Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 16, 2015

Chemical databases: curation or integration by user-defined equivalence?

Filed under: Cheminformatics,Chemistry,InChl,Subject Identity — Patrick Durusau @ 2:52 pm

Chemical databases: curation or integration by user-defined equivalence? by Anne Hersey, Jon Chambers, Louisa Bellis, A. Patrícia Bento, Anna Gaulton, John P. Overington.

Abstract:

There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

The authors identify a number of reasons why databases of chemical identifications have different structures recorded for the same chemicals. One problem is that there is no authoritative source for chemical structures so upon publication, authors publish those aspects most relevant to their interest. Or publish images and not machine readable representations of a chemical. To say nothing of the usual antics with simple names and their confusions. But there are software limitations, business rules and other sources of a multiplicity of chemical structures.

Suffice it to say that the authors make a strong case for why there are multiple structures for any given chemical now and why that is going to continue.

The author’s openly ask if it is time to ask users for their assistance in mapping this diversity of structures:

Is it now time to accept that however diligent database providers are, there will always be differences in structure representations and indeed some errors in the structures that cannot be fixed with a realistic level of resource? Should we therefore turn our attention to encouraging the use and development of tools that enable the mapping together of related compounds rather than concentrate our efforts on ever more curation?

You know my answer to that question.

What’s yours?

I first saw this in a tweet by John P. Overington.

April 19, 2014

On InChI and evaluating the quality of cross-reference links

Filed under: Cheminformatics,Identifiers,InChl,Topic Maps — Patrick Durusau @ 10:33 am

On InChI and evaluating the quality of cross-reference links by Jakub Galgonek and Jiří Vondrášek. (Journal of Cheminformatics 2014, 6:15 doi:10.1186/1758-2946-6-15)

Abstract:

Background

There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results

We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions

We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Another use case for topic maps don’t you think?

Rather than a mapping keyed on recognition of a single identifier, have the mapping keyed to the recognition of several key/value pairs.

I don’t think there is an abstract answer as to the optimum number of key/value pairs that must match for identification. Experience would be a much better guide.

February 14, 2013

InChI in the wild: An Assessment of InChIKey searching in Google

Filed under: Bioinformatics,Cheminformatics,InChl — Patrick Durusau @ 8:19 pm

InChI in the wild: An Assessment of InChIKey searching in Google by Christopher Southan. (Journal of Cheminformatics 2013, 5:10 doi:10.1186/1758-2946-5-10)

Abstract:

While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets and image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

An interesting use of an identifier, not as a key to a database, as a recent comment suggested, but as the basis for enhanced search results.

How else would you use identifiers “in the wild?”

December 16, 2012

The InChI and its influence on the domain of chemical information

Filed under: Cheminformatics,InChl — Patrick Durusau @ 8:26 pm

The InChI and its influence on the domain of chemical information by Bailey Fallon.

From the post:

A thematic series on the IUPAC International Chemical Identifier (InChI) and Its Influence on the Domain of Chemical Information has just seen its first articles published in Journal of Cheminformatics.

The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, allowing it to be used for structure searching in databases and on the web. This thematic issue, edited by Dr Antony Williams at the Royal Society of Chemistry, aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

Certainly should command your attention if you are in cheminformatics.

But also if you want to duplicate its success.

Powered by WordPress