Archive for the ‘Cheminformatics’ Category

Cheminformatics Supplements

Thursday, March 28th, 2013

Cheminformatics Supplements

I ran across a pointer today to abstracts for the 8th German Conference on Chemoinformatics: 26 CIC-Workshop from Chemistry Central

I will pull several of the abstracts for fuller treatment but whatever I choose, I will miss the very abstract of interest to you.

Moreover, the link at the top of this post takes you to all the “supplements” from Chemistry Central.

I am sure you will find a wealth of information.

Introducing tags to Journal of Cheminformatics

Sunday, March 24th, 2013

Introducing tags to Journal of Cheminformatics by Bailey Fallon.

From the post:

Journal of Cheminformatics will now be “tagging” its publications, allowing articles related by common themes to be linked together.

Where an article has been tagged, readers will be able access all other articles that share the same tag via a link at the right hand side of the HTML, making it easier to find related content within the journal.

This functionality has been launched for three resources that appear frequently in Journal of Cheminformatics and we will continue to add tags when relevant.

  • Open Babel: Open Babel is an open source chemical toolbox that interconverts over 110 chemical data formats. The first paper describing the features and implementation of open Babel appeared in Journal of Cheminformatics in 2011, and this tag links it with a number of other papers that use the toolkit
  • PubChem: PubChem is an open archive for the biological activities of small molecules, which provides search and analysis tools to assist users in locating desired information. This tag amalgamates the papers published in the PubChem3D thematic series with other papers reporting applications and developments of PubChem
  • InChI: The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, making it a valuable tool for cheminformaticians, and this tag links a number of papers in Journal of Cheminformatics that rely on its use

It’s not sophisticated authoring of associations but carefully done, tagging can collate information resources for users.

On export to a topic map application, implied roles could be made explicit, assuming the original tagging was consistent.

Curating Inorganics? No. (ChEMBL)

Monday, March 18th, 2013

The results are in – inorganics are out!

From the ChEMBL-og blog which “covers the activities of the Computational Chemical Biology Group at the EMBL-EBI in Hinxton.

From the post:

A few weeks ago we ran a small poll on how we should deal with inorganic molecules – not just simple sodium salts, but things like organoplatinums, and other compounds with dative bonds, unusual electronic states, etc. The results from you were clear, there was little interest in having a lot of our curation time spent on these. We will continue to collect structures from the source journals, and they will be in the full database, but we won’t try and curate the structures, or display them in the interface. They will be appropriately flagged, and nothing will get lost. So there it is, democracy in action.

So for ChEMBL 16 expect fewer issues when you try and load our structures in your own pipelines and systems.

Just an FYI that inorganic compounds are not being curated at ChEMBL.

If you decide to undertake such work, contacting ChEMBL to coordinate collection, etc., would be a good first step.

Crowdsourced Chemistry… [Documents vs. Data]

Monday, March 18th, 2013

Crowdsourced Chemistry Why Online Chemistry Data Needs Your Help by Antony Williams. (video)

From the description:

This is the Ignite talk that I gave at ScienceOnline2010 #sci010 in the Research Triangle Park in North Carolina on January 16th 2010. This was supposed to be a 5 minute talk highlighting the quality of chemistry data on the internet. Ok, it was a little tongue in cheek because it was an after dinner talk and late at night but the data are real, the problem is real and the need for data curation of chemistry data online is real. On ChemSpider we have provided a platform to deposit and curate data. Other videos will show that in the future.

Great demonstration of the need for curation in chemistry.

And of the impact that re-usable information can have on the quality of information.

The errors in chemical descriptions you see in this video could be corrected in:

  • In an article.
  • In a monograph.
  • In a webpage.
  • In an online resource that can be incorporated by reference.

Which one do you think would propagate the corrected information more quickly?

Documents are a great way to convey information to a reader.

They are an incredibly poor way to store/transmit information.

Every reader has to extract the information in a document for themselves.

Not to mention that data is fixed, unless it has incorporated information by reference.

Funny isn’t it? We are still storing data as we did when clay tablets were the medium of choice.

Isn’t it time we separated presentation (documents) from storage/transmission (data)?

Using molecular networks to assess molecular similarity

Friday, February 15th, 2013

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

InChI in the wild: An Assessment of InChIKey searching in Google

Thursday, February 14th, 2013

InChI in the wild: An Assessment of InChIKey searching in Google by Christopher Southan. (Journal of Cheminformatics 2013, 5:10 doi:10.1186/1758-2946-5-10)

Abstract:

While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets and image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

An interesting use of an identifier, not as a key to a database, as a recent comment suggested, but as the basis for enhanced search results.

How else would you use identifiers “in the wild?”

Chemical datuments as scientific enablers

Friday, January 25th, 2013

Chemical datuments as scientific enablers by Henry S Rzepa. (Journal of Cheminformatics 2013, 5:6 doi:10.1186/1758-2946-5-6)

Abstract:

This article is an attempt to construct a chemical datument as a means of presenting insights into chemical phenomena in a scientific journal. An exploration of the interactions present in a small fragment of duplex Z-DNA and the nature of the catalytic centre of a carbon-dioxide/alkene epoxide alternating co-polymerisation is presented in this datument, with examples of the use of three software tools, one based on Java, the other two using Javascript and HTML5 technologies. The implications for the evolution of scientific journals are discussed.

From the background:

Chemical sciences are often considered to stand at the crossroads of paths to many disciplines, including molecular and life sciences, materials and polymer sciences, physics, mathematical and computer sciences. As a research discipline, chemistry has itself evolved over the last few decades to focus its metaphorical microscope on both far larger and more complex molecular systems than previously attempted, as well as uncovering a far more subtle understanding of the quantum mechanical underpinnings of even the smallest of molecules. Both these extremes, and everything in between, rely heavily on data. Data in turn is often presented in the form of visual or temporal models that are constructed to illustrate molecular behaviour and the scientific semantics. In the present article, I argue that the mechanisms for sharing both the underlying data, and the (semantic) models between scientists need to evolve in parallel with the increasing complexity of these models. Put simply, the main exchange mechanism, the scientific journal, is accepted [1] as seriously lagging behind in its fitness for purpose. It is in urgent need of reinvention; one experiment in such was presented as a data-rich chemical exploratorium [2]. My case here in this article will be based on my recent research experiences in two specific areas. The first involves a detailed analysis of the inner kernel of the Z-DNA duplex using modern techniques for interpreting the electronic properties of a molecule. The second recounts the experiences learnt from modelling the catalysed alternating co-polymerisation of an alkene epoxide and carbon dioxide.

Effective sharing of data, in scientific journals or no, requires either a common semantic (we know that’s uncommon) or a mapping between semantics (how may times must we repeat the same mappings, separately?).

Embedding notions of subject identity and mapping between identifications in chemical datuments could increase the reuse of data, as well as its longevity.

UniChem…[How Much Precision Can You Afford?]

Thursday, January 17th, 2013

UniChem: a unified chemical structure cross-referencing and identifier tracking system by Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey and John P Overington. (Journal of Cheminformatics 2013, 5:3 doi:10.1186/1758-2946-5-3)

Abstract:

UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

From the background section:

Since these resources are continually developing in response to largely distinct active user communities, a full integration solution, or even the imposition of a requirement to adopt a common unifying chemical identifier, was considered unnecessarily complex, and would inhibit the freedom of each of the resources to successfully evolve in future. In addition, it was recognized that in the future more small molecule-containing databases might reside at EMBL-EBI, either because existing databases may begin to annotate their data with chemical information, or because entirely new resources are developed or adopted. This would make a full integration solution even more difficult to sustain. A need was therefore identified for a flexible integration solution, which would create, maintain and manage links between the resources, with minimal maintenance costs to the participant resources, whilst easily allowing the inclusion of additional sources in the future. Also, since the solution should allow different resources to maintain their own identifier systems, it was recognized as important for the system to have some simple means of tracking identifier usage, at least in the sense of being able to archive obsolete identifiers and assignments, and indicate when obsolete assignments were last in use.

The UniChem project highlights an important aspect of mapping identifiers: How much mapping can you afford?

Or perhaps even better: What is the cost/benefit ratio for a complete mapping?

The mapping in question isn’t a academic exercise in elegance and completeness.

It’s users have immediate need for the mapping data and it is it not quite right, human users are in the best position to correct it and suggest corrections.

Not to mention that new identifiers are likely to arrive before the old ones are completely mapped.

Suggestive that evolving mappings may be an appropriate paradigm for topic maps.

chemf: A purely functional chemistry toolkit

Thursday, January 17th, 2013

chemf: A purely functional chemistry toolkit by Stefan Höck and Rainer Riedl. (Journal of Cheminformatics 2012, 4:38 doi:10.1186/1758-2946-4-38)

Abstract:

Background

Although programming in a type-safe and referentially transparent style offers several advantages over working with mutable data structures and side effects, this style of programming has not seen much use in chemistry-related software. Since functional programming languages were designed with referential transparency in mind, these languages offer a lot of support when writing immutable data structures and side-effects free code. We therefore started implementing our own toolkit based on the above programming paradigms in a modern, versatile programming language.

Results

We present our initial results with functional programming in chemistry by first describing an immutable data structure for molecular graphs together with a couple of simple algorithms to calculate basic molecular properties before writing a complete SMILES parser in accordance with the OpenSMILES specification. Along the way we show how to deal with input validation, error handling, bulk operations, and parallelization in a purely functional way. At the end we also analyze and improve our algorithms and data structures in terms of performance and compare it to existing toolkits both object-oriented and purely functional. All code was written in Scala, a modern multi-paradigm programming language with a strong support for functional programming and a highly sophisticated type system.

Conclusions

We have successfully made the first important steps towards a purely functional chemistry toolkit. The data structures and algorithms presented in this article perform well while at the same time they can be safely used in parallelized applications, such as computer aided drug design experiments, without further adjustments. This stands in contrast to existing object-oriented toolkits where thread safety of data structures and algorithms is a deliberate design decision that can be hard to implement. Finally, the level of type-safety achieved by \emph{Scala} highly increased the reliability of our code as well as the productivity of the programmers involved in this project.

Another vote in favor of functional programming as a path to parallel processing.

Can the next step, identity transparency*, be far behind?

*Identity transparency: where any identification of an entity can be replaced with another identification of the same entity.

PubChem3D: conformer ensemble accuracy

Sunday, January 13th, 2013

PubChem3D: conformer ensemble accuracy by Sunghwan Kim, Evan E Bolton and Stephen H Bryant. (Journal of Cheminformatics 2013, 5:1 doi:10.1186/1758-2946-5-1)

Abstract:

Background

PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.

Results

The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 +/- 0.12 A), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 +/- 0.03, 0.16 +/- 0.09, 0.09 +/- 0.05, and 0.15 +/- 0.09, respectively).

Conclusion

This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.

If I were to say, potential shapes of a subject, would that the importance of this work clearer?

Wikipedia has this two-liner that may also help:

A macromolecule is usually flexible and dynamic. It can change its shape in response to changes in its environment or other factors; each possible shape is called a conformation, and a transition between them is called a conformational change. A macromolecular conformational change may be induced by many factors such as a change in temperature, pH, voltage, ion concentration, phosphorylation, or the binding of a ligand.

Subjects and the manner of their identification is a very deep and rewarding field of study.

An identification method in isolation is no better or worse than any other identification method.

Only your requirements (which are also subjects) can help with the process of choosing one or more identification methods over others.

Automated compound classification using a chemical ontology

Sunday, January 13th, 2013

Automated compound classification using a chemical ontology by Claudia Bobach, Timo Böhme, Ulf Laube, Anett Püschel and Lutz Weber. (Journal of Cheminformatics 2012, 4:40 doi:10.1186/1758-2946-4-40)

Abstract:

Background

Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds.

Results

In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships.

Conclusions

A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemistry structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.

While creating an ontology to assist with compound classification, the authors concede the literature contains much semantic diversity:

Chemists use a variety of expressions to create compound class terms from a specific compound name – for example “backbone”, “scaffold”, “derivative”, “compound class” are often used suffixes or “substituted” is a common prefix that generates a class term. Unfortunately, the meaning of different chemical class terms is often not defined precisely and their usage may differ significantly due to historic reasons and depending on the compound class. For example, 2-ethyl-imidazole 1 belongs without doubt to the class of compounds having a imidazole scaffold, backbone or being an imidazole derivative or substituted imidazole. In contrast, pregnane 2 illustrates a more complicated case – as in case of 2-ethyl-imidazole this compound could be considered a 17-ethyl-derivative of the androstane scaffold 3. However, this would suggest a wrong compound classification as pregnanes are not considered to be androstane derivatives – although 2 contains androstane 3 as a substructure (Figure 1). This particular, structurally illogical naming convention goes back to the fundamentally different biological activities of specific compounds with a pregnane or androstane backbone, resulting in the perception that androstanes and pregnanes do not show a parent–child relation but are rather sibling concepts at the same hierarchical level. Thus, any expert chemical ontology will appreciate this knowledge and the androstane compound class structural definition needs to contain a definition that any androstane shall NOT contain a carbon substitution at the C-17 position. (emphasis added)

Not that present day researchers would create a structurally illogical naming convention in the view of future researchers.

The IUPAC International Chemical Identifier (InChI)….

Saturday, January 5th, 2013

The IUPAC International Chemical Identifier (InChI) and its influence on the domain of chemical information edited by Dr. Anthony Williams.

From the webpage:

The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to deduplicate, validate and link together chemical compounds and related information across databases. Its influence has been especially valuable as the internet has exploded in terms of the amount of chemistry related information available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

If you are interested in chemistry/cheminformatics or in the development and use of identifers, this is an issue to not miss!

You will find:

InChIKey collision resistance: an experimental testing by Igor Pletnev, Andrey Erin, Alan McNaught, Kirill Blinov, Dmitrii Tchekhovskoi, Steve Heller.

Consistency of systematic chemical identifiers within and between small-molecule databases by Saber A Akhondi, Jan A Kors, Sorel Muresan.

InChI: a user’s perspective by Steven M Bachrach.

InChI: connecting and navigating chemistry by Antony J Williams.

I particularly enjoyed Steven Bachrach’s comment:

It is important to recognize that in no way does InChI replace or make outmoded any other chemical identifier. A company that has developed their own registry system or one that uses one of the many other identifiers, like a MOLfile [13], can continue to use their internal system. Adding the InChI to their system provides a means for connecting to external resources in a simple fashion, without exposing any of their own internal technologies.

Or to put it differently, InChl increased the value of existing chemical identifiers.

How’s that for a recipe for adoption?

Royal Society of Chemistry (RSC) – National Chemical Database Service

Friday, December 28th, 2012

From the homepage: (Goes live: 2nd January 2013)

National Chemical Database Service

The RSC will be operating the EPSRC National Chemical Database Service from 2013-2017

What is the RSC’s vision for the Service?

We intend to build the Service for the future – to develop a chemistry data repository for UK academia, and to build tools, models and services on this data store to increase the value and impact of researchers’ funded work. We will continue to develop this data store through the lifetime of the contract period and look forward to working with the community to make this a world-leading exemplar of the value of research data availability.

The Service will also offer access to a suite of commercial databases and services. While there will be some overlap with currently provided databases popular with the user community we will deliver new data and services and optimize the offering based on user feedback.

When will the Service be available?

The Service will start on 2nd January 2013, and will be available at cds.rsc.org.

The database services we are working to have available at launch are the Cambridge Structural Database, ACD/ILab and Accelrys’ Available Chemicals Directory. The Service will also include integrated access to the RSC’s award winning ChemSpider database. As ‘live’ dates for other services become clear, they will appear here.

See also: Initial Demonstrations of the Interactive Laboratory Service as part of the Chemical Database Service

and,

Initial Demonstration of the Integration to the Accelrys Available Chemicals Directory Web Service

I just looked at the demos but was particularly impressed with their handling of identifiers. Really impressed. There are lessons here for other information services.

BTW, I did have to hunt to discover that RCS = Royal Society of Chemistry. 😉

The InChI and its influence on the domain of chemical information

Sunday, December 16th, 2012

The InChI and its influence on the domain of chemical information by Bailey Fallon.

From the post:

A thematic series on the IUPAC International Chemical Identifier (InChI) and Its Influence on the Domain of Chemical Information has just seen its first articles published in Journal of Cheminformatics.

The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, allowing it to be used for structure searching in databases and on the web. This thematic issue, edited by Dr Antony Williams at the Royal Society of Chemistry, aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

Certainly should command your attention if you are in cheminformatics.

But also if you want to duplicate its success.

Open Babel: One year on

Friday, December 7th, 2012

Open Babel: One year on by Bailey Fallon.

From the post:

Just over a year ago, Journal of Cheminformatics published a paper describing the open source chemical toolbox, Open Babel. Despite almost 10 years as an independent project, a description of the features and implementation of Open Babel had never been formally published. However, in the 14 months since publication, the Open Babel paper has quickly become one of the most influential articles in Journal of Cheminformatics. It is the second most cited article in the journal according to Thomson Reuters Web of Science, and is amongst the most widely read, with close to 10 000 accesses. The software itself has been downloaded over 40 000 times in the last year alone.

Open Babel attempts to solve a common problem in cheminformatics – the need to convert between different chemical structure formats. It offers a solution by allowing anyone to search, convert, analyze, or store data in over 110 formats covering molecular modeling, chemistry, solid-state materials, biochemistry, and related areas.

Introductory training guide to Open Babel (by Noel O’Boyle).

That impressive!

But you need to remember it wasn’t that many years ago when commercial conversion software offered more than 300 source and target formats.

Still, worth taking a deep look to see if there are useful lessons for topic maps.

RMol: …SD/Molfile structure information into R Objects

Saturday, November 17th, 2012

RMol: A Toolset for Transforming SD/Molfile structure information into R Objects by Martin Grabner, Kurt Varmuza and Matthias Dehmer.

Abstract:

Background

The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. As demonstrated frequently, a well designed format to encode chemical structures and structure-related information of organic compounds is the Molfile format. But when it comes to use modern programming languages for statistical data analysis in Bio- and Chemoinformatics, R as one of the most powerful free languages lacks tools to process R Molfile data collections and import molecular network data into R.

Results

We design an R object which allows a lossless information mapping of structural information from Molfiles into R objects. This provides the basis to use the RMol object as an anchor for connecting Molfile data collections with R libraries for analyzing graphs. Associated with the RMol objects, a set of R functions completes the toolset to organize, describe and manipulate the converted data sets. Further, we bypass R-typical limits for manipulating large data sets by storing R objects in bz-compressed serialized files instead of employing RData files.

Conclusions

By design, RMol is a R tool set without dependencies to other libraries or programming languages. It is useful to integrate into pipelines for serialized batch analysis by using network data and, therefore, helps to process sdf-data sets in R effeciently. It is freely available under the BSD licence. The script source can be downloaded from http://sourceforge.net/p/rmol-toolset.

Important work, not the least because of the explosion of interest in bio/cheminformatics.

If I understand the rationale for the software, it:

  1. enables use of existing R tools for graph/network analysis
  2. fits well into workflows with serialized pipelines
  3. dependencies are reduced by extraction of SD-File information
  4. storing chemical and molecular network information in R objects avoids repetitive transformations

All of which are true but I have a nagging concern about the need for transformation.

Knowing the structure of Molfiles and the requirements of R tools for graph/network analysis, how are the results of transformation different from R tools viewing Molfiles “as if” they were composed of R objects?

The mapping is already well known because that is what RMol uses to create the results of transformation. More over, for any particular use, more data may be transformed that is required for a particular analysis.

Not to take anything away from very useful work but the days of transformation of data are numbered. As data sets grow in size, there will be fewer and fewer places to store a “transformed” data set.

BTW, pay particular attention to the bibliography in this paper. Numerous references to follow if you are interested in this area.

Cheminformatics

Wednesday, November 14th, 2012

Cheminformatics by Joerg Kurt Wegner, Aaron Sterling, Rajarshi Guha, Andreas Bender, Jean-Loup Faulon, Janna Hastings, Noel O’Boyle, John Overington, Herman Van Vlijmen, Egon Willighagen.

Key Insights:

  • Molecules with similar physical structure tend to have similar chemical properties.
  • Open-source chemistry programs and open-access molecular databases allow interdisciplinary research opportunities that did not previously exist.
  • Cheminformatics combines biological, chemical, pharmaceutical, and drug patient information to address large-scale data mining, curation and visualization challenges.

Semantic impedance issues abound in cheminformatics.

This article is a very good overview of chemiformatics and an introduction to its many challenges.

8th German Conference on Chemoinformatics [GCC 2012]

Thursday, October 25th, 2012

8th German Conference on Chemoinformatics [GCC 2012]

From the post:

The 8th German Conference on Chemoinformatics takes place in Goslar, Germany next month, and we are pleased to announce that once again, Journal of Cheminformatics will be the official publishing partner and poster session sponsor.

The conference runs from November 11th–13th and covers a wide range of topics around cheminformatics and chemical information including: Chemoinformatics and Drug Discovery; Molecular Modelling; Chemical Information, Patents and Databases; and Computational Material Science and Nanotechnology.

This will be the fourth year that Journal of Cheminformatics has been involved with the conference, and abstracts from the previous three meetings are freely available via the journal website.

The prior meeting abstracts are a very rich source of materials that merit your attention.

mol2chemfig, a tool for rendering chemical structures… [Hard Copy Delivery of Topic Map Content]

Saturday, October 6th, 2012

mol2chemfig, a tool for rendering chemical structures from molfile or SMILES format to LATEX code by Eric K Brefo-Mensah and Michael Palmer.

Abstract:

Displaying chemical structures in LATEX documents currently requires either hand-coding of the structures using one of several LATEX packages, or the inclusion of finished graphics files produced with an external drawing program. There is currently no software tool available to render the large number of structures available in molfile or SMILES format to LATEX source code. We here present mol2chemfig, a Python program that provides this capability. Its output is written in the syntax defined by the chemfig TEX package, which allows for the flexible and concise description of chemical structures and reaction mechanisms. The program is freely available both through a web interface and for local installation on the user?s computer. The code and accompanying documentation can be found at http://chimpsky.uwaterloo.ca/mol2chemfig.

Is there a presumption that topic map delivery systems are limited to computers?

Or that components in topic map interfaces have to snap, crackle or pop with every mouse-over?

While computers enable scalable topic map processing, processing should not be confused with delivery of topic map content.

If you are delivering information about chemical structures from a topic map into hard copy, you are likely to find this a useful tool.

Towards a Universal SMILES representation…

Wednesday, September 19th, 2012

Towards a Universal SMILES representation – A standard method to generate canonical SMILES based on the InChI by Noel M O’Boyle. Journal of Cheminformatics 2012, 4:22 doi:10.1186/1758-2946-4-22

Abstract:

Background

There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.

Results

I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.

Conclusions

The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain — such as the development of a standard aromatic model for SMILES — the ability to create the same SMILEs using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.

Noel notes much work remains to be done but being able to reliably compare the output of different toolkits sounds like a step in the right direction.

Mining Chemical Libraries with “Screening Assistant 2”

Sunday, September 9th, 2012

Mining Chemical Libraries with “Screening Assistant 2” by Vincent Le Guilloux, Alban Arrault, Lionel Colliandre, Stéphane Bourg, Philippe Vayer and Luc Morin-Allory (Journal of Cheminformatics 2012, 4:20 doi:10.1186/1758-2946-4-20)

Abstract:

Background

High-throughput screening assays have become the starting point of many drug discovery programs for large pharmaceutical companies as well as academic organisations. Despite the increasing throughput of screening technologies, the almost in nite chemical space remains out of reach, calling for tools dedicated to the analysis and selection of the compound collections intended to be screened.

Results

We present Screening Assistant 2 (SA2), an open-source JAVA software dedicated to the storage and analysis of small to very large chemical libraries. SA2 stores unique molecules in a MySQL database, and encapsulates several chemoinformatics methods, among which: providers management, interactive visualisation, sca old analysis, diverse subset creation, descriptors calculation, sub-structure / SMART search, similarity search and filtering. We illustrate the use of SA2 by analysing the composition of a database of 15 million compounds collected from 73 providers, in terms of scaffolds, frameworks, and undesired properties as defined by recently proposed HTS SMARTS filters. We also show how the software can be used to create diverse libraries based on existing ones.

Conclusions

Screening Assistant 2 is a user-friendly, open-source software that can be used to manage collections of compounds and perform simple to advanced chemoinformatics analyses. Its modular design and growing documentation facilitate the addition of new functionalities, calling for contributions from the community. The software can be downloaded at http://sa2.sourceforge.net/.

And you thought you had “big data:”

Exploring biology through the activity of small molecules is an established paradigm used in drug research for several decades now [1, 2]. Today, a state of the art drug discovery program often begins with screening campaigns aiming at the identification of novel biologically active molecules. In the recent years, the rise of High Throughput Screening (HTS), combinatorial chemistry and the availability of large compound collections has led to a dramatic increase in the size of screening libraries, for both private companies and public organisations [3, 4]. Yet, despite these constantly increasing capabilities, various authors have stressed the need to design better instead of larger screening libraries [5–9]. Chemical space is indeed known to be almost infinite, and selecting the appropriate regions to explore for the problem at hand remains a challenging task. (emphasis added)

I would paraphrase the highlighted part to read:

Semantic space is known to be infinite, and selecting appropriate regions for the problem at hand remains a challenging task.

Or to put it differently, if I have a mapping strategy that works for clinical information systems or (U.S.) DoD contracting records or some other domain, that’s great!

I don’t need to look for a universal mapping solution.

Not to mention that I can produce results (and get paid) more quickly than waiting for a universal mapping solution.

Giant Network Links All Known Compounds and Reactions

Friday, August 24th, 2012

Let’s start with the “popular” version: Scientists Create Chemical ‘Brain’: Giant Network Links All Known Compounds and Reactions

From the post:

Northwestern University scientists have connected 250 years of organic chemical knowledge into one giant computer network — a chemical Google on steroids. This “immortal chemist” will never retire and take away its knowledge but instead will continue to learn, grow and share.

A decade in the making, the software optimizes syntheses of drug molecules and other important compounds, combines long (and expensive) syntheses of compounds into shorter and more economical routes and identifies suspicious chemical recipes that could lead to chemical weapons.

“I realized that if we could link all the known chemical compounds and reactions between them into one giant network, we could create not only a new repository of chemical methods but an entirely new knowledge platform where each chemical reaction ever performed and each compound ever made would give rise to a collective ‘chemical brain,'” said Bartosz A. Grzybowski, who led the work. “The brain then could be searched and analyzed with algorithms akin to those used in Google or telecom networks.”

Called Chematica, the network comprises some seven million chemicals connected by a similar number of reactions. A family of algorithms that searches and analyzes the network allows the chemist at his or her computer to easily tap into this vast compendium of chemical knowledge. And the system learns from experience, as more data and algorithms are added to its knowledge base.

Details and demonstrations of the system are published in three back-to-back papers in the Aug. 6 issue of the journal Angewandte Chemie.

Well, true enough, except for the “share” part. Chematica is in the process of being commercialized.

If you are interested in the non-“popular” version:

Rewiring Chemistry: Algorithmic Discovery and Experimental Validation of One-Pot Reactions in the Network of Organic Chemistry (pages 7922–7927) by Dr. Chris M. Gothard, Dr. Siowling Soh, Nosheen A. Gothard, Dr. Bartlomiej Kowalczyk, Dr. Yanhu Wei, Dr. Bilge Baytekin and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202155.

Abstract:

Computational algorithms are used to identify sequences of reactions that can be performed in one pot. These predictions are based on over 86 000 chemical criteria by which the putative sequences are evaluated. The “raw” algorithmic output is then validated experimentally by performing multiple two-, three-, and even four-step sequences. These sequences “rewire” synthetic pathways around popular and/or important small molecules.

Parallel Optimization of Synthetic Pathways within the Network of Organic Chemistry (pages 7928–7932) by Dr. Mikołaj Kowalik, Dr. Chris M. Gothard, Aaron M. Drews, Nosheen A. Gothard, Alex Weckiewicz, Patrick E. Fuller, Prof. Bartosz A. Grzybowski and Prof. Kyle J. M. Bishop. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202209.

Abstract:

Finding a needle in a haystack: The number of possible synthetic pathways leading to the desired target of a synthesis can be astronomical (1019 within five synthetic steps). Algorithms are described that navigate through the entire known chemical-synthetic knowledge to identify optimal synthetic pathways. Examples are provided to illustrate single-target optimization and parallel optimization of syntheses leading to multiple targets.

Chemical Network Algorithms for the Risk Assessment and Management of Chemical Threats (pages 7933–7937) by Patrick E. Fuller, Dr. Chris M. Gothard, Nosheen A. Gothard, Alex Weckiewicz and Prof. Bartosz A. Grzybowski. Article first published online: 13 JUL 2012 | DOI: 10.1002/anie.201202210.

Abstract:

A network of chemical threats: Current regulatory protocols are insufficient to monitor and block many short-route syntheses of chemical weapons, including those that start from household products. Network searches combined with game-theory algorithms provide an effective means of identifying and eliminating chemical threats. (Picture: an algorithm-detected pathway that yields sarin (bright red node) in three steps from unregulated substances.)

Do you see any potential semantic issues in such a network? Arising as our understanding of reactions changes?

Recalling that semantics isn’t simply a question of yesterday, today and tomorrow but also of tomorrows, 10, 50, or 100 or more years from now.

We may fancy our present understanding as definitive, but it is just a fancy.

The Semantics of Chemical Markup Language (CML) for Computational Chemistry : CompChem

Sunday, August 12th, 2012

The Semantics of Chemical Markup Language (CML) for Computational Chemistry : CompChem by Weerapong Phadungsukanan, Markus Kraft, Joe A Townsend and Peter Murray-Rust (Journal of Cheminformatics 2012, 4:15 doi:10.1186/1758-2946-4-15)

Abstract (provisional):

This paper introduces a subdomain chemistry format for storing computational chemistry data called CompChem. It has been developed based on the design, concepts and methodologies of Chemical Markup Language (CML) by adding computational chemistry semantics on top of the CML Schema. The format allows a wide range of ab initio quantum chemistry calculations of individual molecules to be stored. These calculations include, for example, single point energy calculation, molecular geometry optimization, and vibrational frequency analysis. The paper also describes the supporting infrastructure, such as processing software, dictionaries, validation tools and database repository. In addition, some of the challenges and difficulties in developing common computational chemistry dictionaries are being discussed. The uses of CompChem are illustrated on two practical applications.

Important contribution if you are working with computational chemistry semantics.

Also important for its demonstration of the value of dictionaries and not trying to be all inclusive.

Integrate the data you have at hand and make allowance for the yet to be known.

Besides, there is always the next topic map that may consume the first with new merging rules.

CompChem Convention http://www.xml-cml.org/convention/compchem

CompChem dictionary http://www.xml-cml.org/dictionary/compchem/

CompChem validation stylesheet https://bitbucket.org/wwmm/cml-specs

CMLValidator http://bitbucket.org/cml/cmllite-validator-code

Chemical Markup Language (CML) http://www.xml-cml.org

Molecules from scratch without the fiendish physics

Sunday, February 12th, 2012

Molecules from scratch without the fiendish physics by Lisa Grossman.

From the post:

But because the equation increases in complexity as more electrons and protons are introduced, exact solutions only exist for the simplest systems: the hydrogen atom, composed of one electron and one proton, and the hydrogen molecule, which has two electrons and two protons.

This complexity rules out the possibility of exactly predicting the properties of large molecules that might be useful for engineering or medicine. “It’s out of the question to solve the Schrödinger equation to arbitrary precision for, say, aspirin,” says von Lilienfeld.

So he and his colleagues bypassed the fiendish equation entirely and turned instead to a computer-science technique.

Machine learning is already widely used to find patterns in large data sets with complicated underlying rules, including stock market analysis, ecology and Amazon’s personalised book recommendations. An algorithm is fed examples (other shoppers who bought the book you’re looking at, for instance) and the computer uses them to predict an outcome (other books you might like). “In the same way, we learn from molecules and use them as previous examples to predict properties of new molecules,” says von Lilienfeld.

His team focused on a basic property: the energy tied up in all the bonds holding a molecule together, the atomisation energy. The team built a database of 7165 molecules with known atomisation energies and structures. The computer used 1000 of these to identify structural features that could predict the atomisation energies.

When the researchers tested the resulting algorithm on the remaining 6165 molecules, it produced atomisation energies within 1 per cent of the true value. That is comparable to the accuracy of mathematical approximations of the Schrödinger equation, which work but take longer to calculate as molecules get bigger (Physical Review Letters, DOI: 10.1103/PhysRevLett.108.058301). (emphasis added)

One way to look at this research is to say we have three avenues to discovering the properties of molecules:

  1. Formal logic – but would require far more knowledge than we have at the moment
  2. Schrödinger equation – but that may be intractable for some molecules
  3. Knowledge-based approach – May be less precise than 1 & 2 but works now.

A knowledge-based approach allows us to make progress now. Topic maps can be annotated with other methods, such as math or research results, up to and including formal logic.

The biggest different with topic maps is that the information you wish to record or act upon is not restricted ahead of time.

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Thursday, February 9th, 2012

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework by Davy Suvee.

From the post:

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

Does it occur to you that aggregation results in one or more aggregates? And if we are presented with one or more aggregates, we could persist those aggregates and add properties to them. Or have relationships between aggregates. Or point to occurrences of aggregates.

Kristina Chodorow demonstrated use of aggregation in MongoDB in Hacking Chess with the MongoDB Pipeline for analysis of chess games. Rather that summing the number of games in which the move “e4” is the first move for White, links to all 696 games could be treated as occurrences of that subject. Which would support discovery of the player of White as well as Black.

Think of aggregation as a flexible means for merging information about subjects and their relationships. (Blind interchange requires more but this is a step in the right direction.)

OSCAR4

Monday, December 19th, 2011

OSCAR4 Launch

From the webpage:

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using OPSIN[1] or with identifiers from the ChEBI (`Chemical Entities of Biological Interest’) ontology.

The current version of OSCAR. OSCAR4, focuses on providing a core library that facilitates integration with other tools. Its simple to use API is modularised to promote extension into other domains and allows for its use within workflow systems like Taverna[2] and U-Compare [3].

We will be hosting a launch on the 13th of April to discuss the new architecture as well as demonstrate some applications that use OSCAR. Tutorial sessions on on how to use the new API will also be provided.

Archived videos from the launch are now online: http://sms.cam.ac.uk/collection/1130934

Just to put this into a topic map context, imagine that the annotation in question was placement in an association with mappings to other data, data that was held by your employer and leased to researchers.

Information extraction from chemical patents

Monday, December 19th, 2011

Information extraction from chemical patents by David M. Jessop.

Abstract:

The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.

Curious to see how the system would perform against U.S. Patent office literature?

Perhaps more to the point, how would it compared to commercial chemical indexing services?

Always possible to duplicate what has already been done.

Curious what current systems, commercial or otherwise, are lacking that could be a value-add proposition?

How would you poll users? In what journals? What survey instruments or practices would you use?

IBM and Drug Companies Donate Data

Wednesday, December 14th, 2011

IBM Contributes Data to the National Institutes of Health to Speed Drug Discovery and Cancer Research Innovation

From the post:

In collaboration with AstraZeneca, Bristol-Myers Squibb, DuPont and Pfizer, IBM is providing a database of more than 2.4 million chemical compounds extracted from about 4.7 million patents and 11 million biomedical journal abstracts from 1976 to 2000. The announcement was made at an IBM forum on U.S. economic competitiveness in the 21st century, exploring how private sector innovations and investment can be more easily shared in the public domain.

Excellent news and kudos to IBM and its partners for making the information available!

Now it is up to you to find creative ways to explore, connect up, analyze the data across other information sets.

My first question would be what was mentioned besides chemicals in the biomedical journal abstracts? Care to make an association to represent that relationship?

Why? Well, for example, if you are exposed to raw benzene, a by product of oil refining, it can produce symptoms that are nearly identical to leukemia. Where would you encounter such a substance? Well, try living in Nicaragua for more than a decade and every day the floors are cleaned with raw benzene. Of course, in the States, doctors don’t check for exposure to banned substances. Cases like that.

BTW, the data is already up, see: PubChem. Follow the links to the interface and click on “structures.” Not my area but the chemical structures are interesting enough that I may have to get a chemistry book for Christmas so I can have some understanding of what I am seeing.

That is probably the best part of being interested in semantic integration is that it cuts across all fields and new discoveries await with every turn of the page.

Chemical Entity Semantic Specification:…(article)

Tuesday, August 23rd, 2011

Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration by Leonid L Chepelev and Michel Dumontier, Journal of Cheminformatics 2011, 3:20doi:10.1186/1758-2946-3-20.

Abstract

Background
Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.

Results
Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.

Conclusions
By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.

Project homepage: Chemical Entity Semantic Specification

Quantitative structure-activity relationship (QSAR)

Monday, August 15th, 2011

I ran across enough materials while researching AZOrange that I needed to make a separate post on QSAR:

The Cheminformatics and QSAR Society

An Introduction to QSAR Methodology by Allen B. Richon

Quantitative structure-activity relationship – Wikipedia

QSAR World

Of greatest interest for people involved in cheminformatics, toxicity, drug development, etc.

Subject identity cuts across every field.

Now that would be an ambitious and interesting book, “Subject Identity.” An edited volume with contributions from experts in a variety of fields.