Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 13, 2011

CAS Registry Number & The Semantic Web

Filed under: Cheminformatics,Identifiers,Indexing — Patrick Durusau @ 3:47 pm

CAS Registry Number

Another approach to the problem of identification, assign an arbitrary identifier for which you hold the key.

If you start early enough in a particular era, you can gain enough of an advantage to deter most competitors. Particularly if you curate the professional literature so that you can provide effective searching based on your (and other) identifiers.

The similarity to the Semantic Web’s assignment of a URL to every subject is not accidental.

The main differences with the Semantic Web:

  1. Economically important activity was focus of the project.
  2. Professional literature base with obvious value-add potential for research and production.
  3. Single source curators of the identifiers (did not whine at others to create them).
  4. Identification where there was user demand to support the effort.

The Wiki page reports (in part):

CAS Registry Numbers are unique numerical identifiers assigned by the “Chemical Abstracts Service” to every chemical described in the open scientific literature (currently including those described from at least 1957 through the present) and including elements, isotopes, organic and inorganic compounds, organometallics, metals, alloys, coordination compounds, minerals, and salts; as well as standard mixtures, compounds, polymers; biological sequences including proteins & nucleic acids; nuclear particles, and nonstructurable materials (aka ‘UVCB’s- i.e., materials of Unknown, Variable Composition, or Biological origin). They are also referred to as CAS RNs, CAS Numbers, etc.

The Registry maintained by CAS is an authoritative collection of disclosed chemical substance information. Currently the CAS Registry identifies more than 56 million organic and inorganic substances and 62 million sequences, plus additional information about each substance; and the Registry is updated with an approximate 12,000 additional new substances daily.

Historically, chemicals have been identified by a wide variety of synonyms. Frequently these are arcane and constructed according to regional naming conventions relating to chemical formulae, structures or origins. Well-known chemicals may additionally be known via multiple generic, historical, commercial, and/or black-market names.

PS: The index is now at 61+ million substances.

InChl – IUPAC International Chemical Identifier

Filed under: Cheminformatics,Identifiers — Patrick Durusau @ 3:47 pm

The Semantic Chemical Entity Specification was useful in pointing me towards InChl – IUPAC International Chemical Identifiers (Wiki page).

From the Wiki page:

The identifiers describe chemical substances in terms of layers of information — the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information. Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application.

InChIs differ from the widely used CAS registry numbers in three respects:

  • they are freely usable and non-proprietary;
  • they can be computed from structural information and do not have to be assigned by some organization;
  • most of the information in an InChI is human readable (with practice).

I like the compute from structural information aspect. Reminds me of Eric Freese and his topic map example that calculated extended family relationships based on parent/child, sibling relationships.

What other areas would benefit from computable identifications and how would you go about constructing them? Such that the same set of inputs results in the same identifier?

The Wiki page cites a number of other resources on chemical identification that will be useful if you are straying into work with chemical databases.

Chemical Entity Semantic Specification

Filed under: Cheminformatics,RDF,Semantic Web — Patrick Durusau @ 3:46 pm

Chemical Entity Semantic Specification

From the website:

Chemical Entity Semantic Specification (CHESS) framework strives to provide a means of representing chemical data with the goal of facile chemical information federation and addressing increasingly rich and complex queries for biological, pharmaceutical, and synthetic chemistry applications. The principal emphasis of CHESS is data representation to assist in metabolic fate determination, synthetic pathway construction, and automatic chemical entity classification. With explicit semantic specification of reactions for example, CHESS allows the tracing of the mechanisms of chemical transformations on the level of individual atoms, bonds, functional groups, or molecules, as well as the individual “histories” of elements of chemical entities in a pathway. Further, the CHESS framework draws on CHEMINF and SIO ontologies to provide methods for specifying uncertainty, conformer-specific information, units, and circumstances for physical measurements at variable levels of granularity, permitting rich, cross-domain queries over this data. In addition to this, CHESS provides a set of specifications to address data federation through the adoption of unique, canonical identifiers for many classes of chemical entities.

Interesting project but appears to lack uptake.

As of 13 August 2011, I get nine (9) “hits” from a popular search engine on the name as a string.

Useful as a resource for existing ontologies and identification schemes.

August 11, 2011

The joy of algorithms and NoSQL: a MongoDB example (part 2)

Filed under: Algorithms,Cheminformatics,MapReduce,MongoDB — Patrick Durusau @ 6:35 pm

The joy of algorithms and NoSQL: a MongoDB example (part 2)

From the post:

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!

Screening “…millions of compounds in subsecond time” sounds useful in a topic map context.

August 7, 2011

The joy of algorithms and NoSQL: a MongoDB example

Filed under: Cheminformatics,Similarity — Patrick Durusau @ 7:05 pm

The joy of algorithms and NoSQL: a MongoDB example

From the post:

In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.

1. Molecular similarity theory

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.

If similarity is domain specific, what are the similarity measures in your favorite domain?

August 2, 2011

International QSAR Foundation

Filed under: Cheminformatics,QSAR — Patrick Durusau @ 7:52 pm

International QSAR Foundation

From the website:

The International QSAR Foundation is the only nonprofit research organization devoted solely to creating alternative methods for identifying chemical hazards without further laboratory testing.

We develop, implement and support new QSAR technologies for use in regulation, research and education or wherever testing animals with chemicals is now required. QSAR models predict chemical behavior directly from chemical structure and simulate adverse effects in cells, tissues and lab animals.

When combined with other alternative test methods, QSAR can minimize the the need for animal tests while improving safe use of drugs and other chemicals. (emphasis added)

Subject identification by predicted behavior anyone?

QSAR Toolbox

Filed under: Cheminformatics,QSAR — Patrick Durusau @ 7:52 pm

QSAR Toolbox

From the website:

The category approach used in the Toolbox:

  • Focuses on intrinsic properties of chemicals (mechanism or mode of action, (eco-)toxicological effects).
  • Allows for entire categories of chemicals to be assessed when only a few members are tested, saving costs and the need for testing on animals.
  • Enables robust hazard assessment through mechanistic comparisons without testing.

The QSAR Toolbox is a software intended to be used by governments, the chemical industry and other stakeholders to fill gaps in (eco-)toxicity data needed for assessing the hazards of chemicals. The Toolbox incorporates information and tools from various sources into a logical workflow. Grouping chemicals into chemical categories is crucial to this workflow.

December 27, 2010

Network Science – NetSci

Filed under: Bioinformatics,Cheminformatics,Drug Discovery,Pharmaceutical Research — Patrick Durusau @ 2:20 pm

Warning: NetSci has serious issues with broken links.

Network Science – NetSci: An Extensive Set of Resources for Science in Drug Discovery

From the website:

Welcome to the Network Science website. This site is dedicated to the topics of pharmaceutical research and the use of advanced techniques in the discovery of new therapeutic agents. We endeavor to provide a comprehensive look at the industry and the tools that are in use to speed drug discovery and development.

I stumbled across this website while looking for computational chemistry resources.

Pharmaceutical research is rich in topic map type issues, from mapping across the latest reported findings in journal literature to matching those identifications to results in computational software.

Questions:

  1. Develop a drug discovery account that illustrates how topic maps might or might not help in that process. (5-7 pages, citations)
  2. What benefits would a topic map bring to drug discovery and how would you illustrate those benefits for a grant application either to a pharmaceutical company or granting agency? (3-5 pages, citations)
  3. Where would you submit a grant application based on #2? (3-5 pages, citations) (Requires researching what activities in drug development are funded by particular entities.)
  4. Prepare a grant application based on the answer to #3. (length depends on grantor requirements)
  5. For extra credit, update and/or correct twenty (20) links from this site. (Check with me first, I will maintain a list of those already corrected.)

October 21, 2010

mloss.org – machine learning open source software

mloss.org – machine learning open source software

Open source repository of machine learning software.

Not only are subjects being recognized by these software packages but their processes and choices are subjects as well. Not to mention their description in the literature.

Fruitful grounds for adaptation to topic maps as well as being the subject of topic maps.

There are literally hundreds of software packages here so I welcome suggestions, comments, etc. on any and all of them.

Questions:

  1. Examples of vocabulary mis-match in machine learning literature?
  2. Using one sample data set, how would you integrate results from different packages? Assume you are not merging classifiers.
  3. What if the classifiers are unknown? That is all you have are the final results. Is your result different? Reliable?
  4. Describe a (singular) merging of classifiers in subject identity terms.

Shogun – A Large Scale Machine Learning Toolbox

Filed under: Bioinformatics,Cheminformatics,Kernel Methods,Pattern Recognition — Patrick Durusau @ 5:08 am

Shogun – A Large Scale Machine Learning Toolbox

Not for the faint of heart but an excellent resource for those interested in large scale kernel methods.

Offers several Support Vector Machine (SVM) implementations and implementations of the latest kernels. Has interfaces to Mathlab(tm), R, Octave and Python.

Questions:

  1. Pick any one of the methods. How would you integrate it into augmented authoring for a topic map?
  2. What aspect(s) of this site would you change using topic maps?
  3. What augmented authoring techniques that would help you apply topic maps to this site?
  4. Apply topic maps to this site. (project)

October 20, 2010

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
Authors: Aaron Smalter, Jun Huan and Gerald Lushington

Abstract:

Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modeling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge.

In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing ? frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

The authors also note:

Publicly-available large-scale chemical compound databases have offered tremendous opportunities for creating highly efficient in silico drug design methods. Many machine learning and data mining algorithms have been applied to study the structure-activity relationship of chemicals with the goal of building classifiers for graph-structured data.

In other words, with a desktop machine, public data and a little imagination, you can make a fundamental contribution to drug design methods. (FWI, the pharmaceuticals are making money hand over fist.)

Integrating your contribution or its results into existing information, such as with topic maps, will only increase its value.

September 22, 2010

Journal of Cheminformatics

Filed under: Cheminformatics,Database,Subject Identity — Patrick Durusau @ 8:09 pm

Journal of Cheminformatics.

Journal of Cheminformatics is an open access, peer-reviewed, online journal encompassing all aspects of cheminformatics and molecular modeling including:

  • chemical information systems, software and databases, and molecular modelling
  • chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases
  • computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques

A good starting place for chemical subject identity issues.

September 21, 2010

ICEP – Indiana Cheminformatics Education Portal

Filed under: Cheminformatics — Patrick Durusau @ 5:19 am

ICEP – Indiana Cheminformatics Education Portal.

ICEP is a repository of freely accessible cheminformatics educational materials generated by the Indiana University Cheminformatics Program

Learn cheminformatics or a great starting place on cheminformatics subject identity issues.

« Newer Posts

Powered by WordPress