Archive for the ‘Ontology’ Category

Does statistics have an ontology? Does it need one? (draft 2)

Tuesday, April 16th, 2013

Does statistics have an ontology? Does it need one? (draft 2) by D. Mayo.

From the post:

Chance, rational beliefs, decision, uncertainty, probability, error probabilities, truth, random sampling, resampling, opinion, expectations. These are some of the concepts we bandy about by giving various interpretations to mathematical statistics, to statistical theory, and to probabilistic models. But are they real? The question of “ontology” asks about such things, and given the “Ontology and Methodology” conference here at Virginia Tech (May 4, 5), I’d like to get your thoughts (for possible inclusion in a Mayo-Spanos presentation).* Also, please consider attending**.

Interestingly, I noticed the posts that have garnered the most comments have touched on philosophical questions of the nature of entities and processes behind statistical idealizations (e.g.,

The post and ensuing comments offer much to consider.

From my perspective, if assumptions, ontological and otherwise, go unstated, the results opaque.

You can accept them, because they fit your prior opinion or how you wanted the results to be, or reject them as not fitting your prior opinion or desired result.

Lazy D3 on some astronomical data

Friday, April 5th, 2013

Lazy D3 on some astronomical data by simonraper.

From the post:

I can’t claim to be anything near an expert on D3 (a JavaScript library for data visualisation) but being both greedy and lazy I wondered if I could get some nice results with minimum effort. In any case the hardest thing about D3 for a novice to the world of web design seems to be getting started at all so perhaps this post will be useful for getting people up and running.

astronomy ontology

The images above and below are visualisations using D3 of a classification hierarchy for astronomical objects provided by the IVOA (International Virtual Observatory Alliance). I take no credit for the layout. The designs are taken straight from the D3 examples gallery but I will show you how I got the environment set up and my data into the graphs. The process should be replicable for any hierarchical dataset stored in a similar fashion.

Even better than the static images are various interactive versions such as the rotating Reingold–Tilford Tree, the collapsible dendrogram and collapsible indented tree . These were all created fairly easily by substituting the astronomical object data for the data in the original examples. (I say fairly easily as you need to get the hierarchy into the right format but more on that later.)

Easier to start with visualization of standard information structures and then move onto more exotic ones.

What is the difference between a Taxonomy and an Ontology?

Monday, April 1st, 2013

What is the difference between a Taxonomy and an Ontology?

From the post:

In the world of information management, two common terms that people use are “taxonomy” and “ontology” but people often wonder what the difference between the two terms are. In many of our webinars, this question comes up so I wanted to provide an answer on our blog.

When I first read this post, I thought it was an April Fool’s post. But check the date: March 15, 2013. Unless April Fool’s day came early this year.

After reading the post you will find that what the author calls a taxonomy is actually an ontology.

Don’t take my word for it, see the original post.

I think the difference between a taxonomy and an ontology is that an ontology costs more.

I don’t know of any other universal differences between the two.

I first saw this in Taxonomy or Ontology by April Holmes.

Leveraging Ontologies for Better Data Integration

Thursday, February 21st, 2013

Leveraging Ontologies for Better Data Integration by David Linthicum.

From the post:

If you don’t understand application semantics ‑ simply put, the meaning of data ‑ then you have no hope of creating the proper data integration solution. I’ve been stating this fact since the 1990s, and it has proven correct over and over again.

Just to be clear: You must understand the data to define the proper integration flows and transformation scenarios, and provide service-oriented frameworks to your data integration domain, meaning levels of abstraction. This is applicable both in the movement of data from source to target systems, as well as the abstraction of the data using data virtualization approaches and technology, such as technology for the host of this blog.

This is where many data integration projects fall down. Most data integration occurs at the information level. So, you must always deal with semantics and how to describe semantics relative to a multitude of information systems. There is also a need to formalize this process, putting some additional methodology and technology behind the management of metadata, as well as the relationships therein.

Many in the world of data integration have begun to adopt the notion of ontology (or the instances of ontology: ontologies). Ontology is a term borrowed from philosophy that refers to the science of describing the kinds of entities in the world and how they are related.

Why should we care? Ontologies are important to data integration solutions because they provide a shared and common understanding of data that exists within the business domain. Moreover, ontologies illustrate how to facilitate communication between people and information systems. You can think of ontologies as the understanding of everything, and how everything should interact to reach a common objective. In this case the optimization of the business. (emphasis added)

The two bolded lines I wanted to call to your attention:

If you don’t understand application semantics ‑ simply put, the meaning of data ‑ then you have no hope of creating the proper data integration solution. I’ve been stating this fact since the 1990s, and it has proven correct over and over again.

I wasn’t aware understanding the “meaning of data” as a prerequisite to data integration was ever contested?


I am equally unsure that having a “…common and shared understanding of data…” qualifies as an ontology.

Which is a restatement of the first point.

What interests me is how to go from non-common and non-shared understandings of data to capturing all the currently known understandings of the data?

Repeating what is uncontested or already agreed upon, isn’t going to help with that task.

How Stable is Your Ontology?

Tuesday, February 19th, 2013

Assessing identity, redundancy and confounds in Gene Ontology annotations over time by Jesse Gillis and Paul Pavlidis. (Bioinformatics (2013) 29 (4): 476-482. doi: 10.1093/bioinformatics/bts727)


Motivation: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.

Results: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their ‘functional identity’ over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.

Availability: Data available at

How does your ontology account for changes in identity over time?

Chaotic Nihilists and Semantic Idealists [And What of Users?]

Tuesday, February 5th, 2013

Chaotic Nihilists and Semantic Idealists by Alistair Croll.

From the post:

There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.

One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.

The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.

They’re wrong.

Alistair continues with the other camp:

Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.

She’s wrong too.

And projects the future as:

The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.

Why are we missing asking users what they meant as a third option?

Depends on who you want to be in charge:

Algorithms — Empower Computer Scientists.

Ontologies/taxonomies — Empower Ontologists.

Asking Users — Empowers Users.

Topic maps are a solution that can ask users.

Any questions?

Automated compound classification using a chemical ontology

Sunday, January 13th, 2013

Automated compound classification using a chemical ontology by Claudia Bobach, Timo Böhme, Ulf Laube, Anett Püschel and Lutz Weber. (Journal of Cheminformatics 2012, 4:40 doi:10.1186/1758-2946-4-40)



Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds.


In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships.


A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemistry structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.

While creating an ontology to assist with compound classification, the authors concede the literature contains much semantic diversity:

Chemists use a variety of expressions to create compound class terms from a specific compound name – for example “backbone”, “scaffold”, “derivative”, “compound class” are often used suffixes or “substituted” is a common prefix that generates a class term. Unfortunately, the meaning of different chemical class terms is often not defined precisely and their usage may differ significantly due to historic reasons and depending on the compound class. For example, 2-ethyl-imidazole 1 belongs without doubt to the class of compounds having a imidazole scaffold, backbone or being an imidazole derivative or substituted imidazole. In contrast, pregnane 2 illustrates a more complicated case – as in case of 2-ethyl-imidazole this compound could be considered a 17-ethyl-derivative of the androstane scaffold 3. However, this would suggest a wrong compound classification as pregnanes are not considered to be androstane derivatives – although 2 contains androstane 3 as a substructure (Figure 1). This particular, structurally illogical naming convention goes back to the fundamentally different biological activities of specific compounds with a pregnane or androstane backbone, resulting in the perception that androstanes and pregnanes do not show a parent–child relation but are rather sibling concepts at the same hierarchical level. Thus, any expert chemical ontology will appreciate this knowledge and the androstane compound class structural definition needs to contain a definition that any androstane shall NOT contain a carbon substitution at the C-17 position. (emphasis added)

Not that present day researchers would create a structurally illogical naming convention in the view of future researchers.

Manual Alignment of Anatomy Ontologies

Saturday, January 12th, 2013

Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: results from a manual alignment by Matthew A. Bertone, István Mikó, Matthew J. Yoder, Katja C. Seltmann, James P. Balhoff, and Andrew R. Deans. (Database (2013) 2013 : bas057 doi: 10.1093/database/bas057)


Matching is an important step for increasing interoperability between heterogeneous ontologies. Here, we present alignments we produced as domain experts, using a manual mapping process, between the Hymenoptera Anatomy Ontology and other existing arthropod anatomy ontologies (representing spiders, ticks, mosquitoes and Drosophila melanogaster). The resulting alignments contain from 43 to 368 mappings (correspondences), all derived from domain-expert input. Despite the many pairwise correspondences, only 11 correspondences were found in common between all ontologies, suggesting either major intrinsic differences between each ontology or gaps in representing each group’s anatomy. Furthermore, we compare our findings with putative correspondences from Bioportal (derived from LOOM software) and summarize the results in a total evidence alignment. We briefly discuss characteristics of the ontologies and issues with the matching process.

Database URL:

A great example of the difficulty of matching across ontologies, particularly when the granularity or subjects of ontologies vary.

Ontology Alert! Molds are able to reproduce sexually

Thursday, January 10th, 2013

Unlike we thought for 100 years: Molds are able to reproduce sexually

For over 100 years, it was assumed that the penicillin-producing mould fungus Penicillium chrysogenum only reproduced asexually through spores. An international research team led by Prof. Dr. Ulrich Kück and Julia Böhm from the Chair of General and Molecular Botany at the Ruhr-Universität has now shown for the first time that the fungus also has a sexual cycle, i.e. two “genders”. Through sexual reproduction of P. chrysogenum, the researchers generated fungal strains with new biotechnologically relevant properties – such as high penicillin production without the contaminating chrysogenin. The team from Bochum, Göttingen, Nottingham (England), Kundl (Austria) and Sandoz GmbH reports in PNAS. The article will be published in this week’s Online Early Edition and was selected as a cover story.

J. Böhm, B. Hoff, C.M. O’Gorman, S. Wolfers, V. Klix, D. Binger, I. Zadra, H. Kürnsteiner, S. Pöggeler, P.S. Dyer, U. Kück (2013): Sexual reproduction and mating-type – mediated strain development in the penicillin-producing fungus Penicillium chrysogenum, PNAS, DOI: 10.1073/pnas.1217943110

If you have hard coded asexual reproduction into your ontology, time to reconsider that decision. And get agreement on reworking all the dependent relationships.

Semantically enabling a genome-wide association study database

Saturday, January 5th, 2013

Semantically enabling a genome-wide association study database by Tim Beck, Robert C Free, Gudmundur A Thorisson and Anthony J Brookes. Journal of Biomedical Semantics 2012, 3:9 doi:10.1186/2041-1480-3-9.



The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central — a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.


A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.


We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.

Rather than:

The benefits of employing ontologies for standardising and structuring data are widely accepted.

I would rephrase that to read:

The benefits and limitations of employing ontologies for standardising and structuring data are widely known.

Decades of use of relational database schemas, informal equivalents of ontologies, leave no doubt governing structures for data have benefits.

Less often acknowledged is those same governing structures impose limitations on data and what may be represented.

That’s not a dig at relational databases.

Just an observation that ontologies and their equivalents aren’t unalloyed precious metals.

Standard Upper Merged Ontology (SUMO), One of the “Less Fortunate” at Christmas Time.

Sunday, December 23rd, 2012

At this happy time of the year you should give some thought to the “less fortunate,” such as the Standard Upper Merged Ontology (SUMO).

Elementary school physics teaches four (4) states of matter: solid, liquid, gas, plasma, which SUMO enshrines as:

(subclass PhysicalState InternalAttribute)
(contraryAttribute Solid Liquid Gas Plasma)
(exhaustiveAttribute PhysicalState Solid Fluid Liquid Gas Plasma)
(documentation PhysicalState EnglishLanguage "The physical state of an &%Object. There
are three reified instances of this &%Class: &%Solid, &%Liquid, and &%Gas.
Physical changes are not characterized by the transformation of one
substance into another, but rather by the change of the form (physical
states) of a given substance. For example, melting an iron nail yields a
substance still called iron.")

Best thing is just to say it, there are over 500 phases of matter. A new method for classifying the states of matter offers insight into the design of superconductors and quantum computers.

SUMO is still “valid” in the sense Newtonian physics are still “valid,” provided your instruments or requirements are crude enough.

Use of these new states in research and engineering are underway, making indexing and retrieval active concerns.

Should we could ask researchers to withhold publications until SUMO and other ontology based systems have time to catch up?

Other alternatives?

I first saw this in: The 500 Phases of Matter: New System Successfully Classifies Symmetry-Protected Phases (Science Daily).

See also:

X. Chen, Z.-C. Gu, Z.-X. Liu, X.-G. Wen. Symmetry-Protected Topological Orders in Interacting Bosonic Systems. Science, 2012; 338 (6114): 1604 DOI: 10.1126/science.1227224

Tranformation versus Addition (How Ontologies Differ from Topic Maps)

Monday, November 12th, 2012

While reading An Ontological Representation of Biomedical Data Sources and Records by Michael Bada, Kevin Livingston, and Lawrence Hunter, I realized an essential difference between ontologies and topic maps.

Bada and colleagues developed:

an an OWL-based model for the representation of these database records as an intermediate solution for the integration of these data in RDF stores.

That is to say they transformed the original records into a representation in OWL.

Which then allowed them to query consistently across the records, due to the transformation into a new, uniform representation.

Contrast that to topic maps, which offer an additive solution.

Topic maps enable the creation of an entity and the addition to that entity the equivalent identifications from all 17 databases.

Any other databases that become of interest can be added to the topic map in the same way.

Another way to say the difference is that ontologies set forth “a” way to make any statement, whereas topic maps collect multiple ways to say the same thing.

Which solution works best for you will depend on your requirements, existing efforts in your field, data that you wish to use, etc.

None of those considerations involve the software being sold by a vendor, advocated by devotees or similar considerations.

Any solution should fit your needs or you should simply walk away.

An Ontological Representation of Biomedical Data Sources and Records [Data & Record as Subjects]

Monday, November 12th, 2012

An Ontological Representation of Biomedical Data Sources and Records by Michael Bada, Kevin Livingston, and Lawrence Hunter.


Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast quantities of data in semantically divergent databases. However, these repositories often conflate data-source records, which are information content entities, and the biomedical concepts and assertions denoted by them. We propose an ontological model for the representation of data sources and their records as an extension of the Information Artifact Ontology. Using this model, we have consistently represented the contents of 17 prominent biomedical databases as a 5.6-billion RDF-triple knowledge base, enabling querying and inference over this large store of integrated data.

Recognition of the need to treat data containers as subjects, along with the data they contain, is always refreshing.

In particular because the evolution of data sources can be captured, as the authors remark:

Our ontology is fully capable of handling the evolution of data sources: If the schema of a given data set is changed, a new instance of the schema is simply created, along with the instances of the fields of the new schema. If the data sets of a data source change (or a new set is made available), an instance for each new data set can be created, along with instances for its schema and fields. (Modeling of incremental change rather than creation of new instances may be desirable but poses significant representational challenges.) Additionally, using our model, if a researcher wishes to work with multiple versions of a given data source (e.g., to analyze some aspect of multiple versions of a given database), an instance for each version of the data source can be created. If different versions of a data source consist of different data sets (e.g., different file organizations) and/or different schemas and fields, the explicit representation of all of these elements and their linkages will make the respective structures of the disparate data-source versions unambiguous. Furthermore, it may be the case that only a subset of a data source needs to be represented; in such a case, only instances of the data sets, schemas, and fields of interest are created.

I first saw this in a tweet by Anita de Waard.

An Ontology of Reasoning, Certainty and Attribution (ORCA)

Monday, November 12th, 2012

An Ontology of Reasoning, Certainty and Attribution (ORCA) by Anita de Waard and Jodi Schneider.

Anita’s slides for her presentation tomorrow at ISWC2012.

Interesting slides that conclude with copious references to the literature.

I don’t doubt that “hedging” can be detected, but am less certain about the granularity allowed by the presented model.

Still, interesting research and merits your attention.

kaon, the Knowledge Attribution Ontology

Friday, November 9th, 2012

kaon, the Knowledge Attribution Ontology by Aidan Hogan and Jodi Schneider (DERI).


orca, the Ontology of Reasoning, Certainty and Attribution, is an ontology for characterizing the certainty of information, how it is known, and its source

I am not sure of the utility of hyperlinks to identify authors when they are not publicly accessible (as is the case here, 9 November 2012).

Being curious about the usage of “orca:directlyLessCertainThan” I searched for its usage, finding only the vocabulary page.

Ditto for: “”

Research of usage of terms in ontological vocabularies and the communities that use them, say in the Common Crawl dataset could be quite useful.

Semantics, afterall, are determined by common usage, not decree.

Semantic Technologies — Biomedical Informatics — Individualized Medicine

Friday, November 9th, 2012

Joint Workshop on Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine (SATBI+SWIM 2012) (In conjunction with International Semantic Web Conference (ISWC 2012) Boston, Massachusetts, U.S.A. November 11-15, 2012)

If you are at ISWC, consider attending.

To help with that choice, the accepted papers:

Jim McCusker, Jeongmin Lee, Chavon Thomas and Deborah L. McGuinness. Public Health Surveillance Using Global Health Explorer. [PDF]

Anita de Waard and Jodi Schneider. Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA). [PDF]

Alexander Baranya, Luis Landaeta, Alexandra La Cruz and Maria-Esther Vidal. A Workflow for Improving Medical Visualization of Semantically Annotated CT-Images. [PDF]

Derek Corrigan, Jean Karl Soler and Brendan Delaney. Development of an Ontological Model of Evidence for TRANSFoRm Utilizing Transition Project Data. [PDF]

Amina Chniti, Abdelali BOUSSADI, Patrice DEGOULET, Patrick Albert and Jean Charlet. Pharmaceutical Validation of Medication Orders Using an OWL Ontology and Business Rules. [PDF]

Manual Gene Ontology annotation workflow

Sunday, November 4th, 2012

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database by Harold J. Drabkin, Judith A. Blake and for the Mouse Genome Informatics Database. Database (2012) 2012 : bas045 doi: 10.1093/database/bas045.


The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource ( The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

Semantic uniformity is achievable, in a limited enough sphere, provided you are willing to pay the price for it.

It has a high rate of return over less carefully curated content.

The project is producing high quality results, although hampered by a lack of resources.

My question is whether a similar high quality of results could be achieved with less semantically consistent curation by distributed contributors?

Harnessing the community of those interested in such a resource. And refining those less semantically consistent entries into higher quality annotations.

Pointers to examples of such projects?

The “O” Word (Ontology) Isn’t Enough

Tuesday, October 16th, 2012

The Units Ontology makes reference to the Gene Ontology as an example of a successful web ontology effort.

As it should. The Gene Ontology (GO) is the only successful web ontology effort. A universe with one (1) inhabitant.

The GO has a number of differences from wannabe successful ontology candidates. (see the article below)

The first difference echoes loudly across the semantic engineering universe:

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers. Terms were created by those who had expertise in the domain, thus avoiding the huge effort that would have been required for a computer scientist to learn and organize large amounts of biological functional information. This also led to general acceptance of the terminology and its organization within the community. This is not to say that there have been no disagreements among biologists over the conceptualization, and there is of course a protocol for arriving at a consensus when there is such a disagreement. However, a model of a domain is more likely to conform to the shared view of a community if the modelers are within or at least consult to a large degree with members of that community.

Did you catch that first line?

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers.

Saying the “O” word, ontology, that will benefit everyone if they will just listen to you, isn’t enough.

There are other factors to consider:

A Short Study on the Success of the Gene Ontology by Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, Suzanna Lewis.


While most ontologies have been used only by the groups who created them and for their initially defined purposes, the Gene Ontology (GO), an evolving structured controlled vocabulary of nearly 16,000 terms in the domain of biological functionality, has been widely used for annotation of biological-database entries and in biomedical research. As a set of learned lessons offered to other ontology developers, we list and briefly discuss the characteristics of GO that we believe are most responsible for its success: community involvement; clear goals; limited scope; simple, intuitive structure; continuous evolution; active curation; and early use.

The Units Ontology: a tool for integrating units of measurement in science

Sunday, October 14th, 2012

The Units Ontology: a tool for integrating units of measurement in science by Georgios V. Gkoutos, Paul N. Schofield, and Robert Hoehndorf. ( Database (2012) 2012 : bas033 doi: 10.1093/database/bas03)


Units are basic scientific tools that render meaning to numerical data. Their standardization and formalization caters for the report, exchange, process, reproducibility and integration of quantitative measurements. Ontologies are means that facilitate the integration of data and knowledge allowing interoperability and semantic information processing between diverse biomedical resources and domains. Here, we present the Units Ontology (UO), an ontology currently being used in many scientific resources for the standardized description of units of measurements.

As the paper acknowledges, there are many measurement systems in use today.

Leaves me puzzled as to what happens to data that follows some other drummer? Other than this one?

I assume any coherent system has no difficulty integrating data written in that system.

So how does adding another coherent system assist in that integration?

Unless everyone universally moves to the new system. Unlikely don’t you think?

The 2012 ACM Computing Classification System toc

Friday, September 21st, 2012

The 2012 ACM Computing Classification System toc

From the post:

The 2012 ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the de facto standard classification system for the computing field. It is being integrated into the search capabilities and visual topic displays of the ACM Digital Library. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. ACM will a provide tools to facilitate the application of 2012 CCS categories to forthcoming papers and a process to ensure that the CCS stays current and relevant. The new classification system will play a key role in the development of a people search interface in the ACM Digital Library to supplement its current traditional bibliographic search.

The full CCS classification tree is freely available for educational and research purposes in these downloadable formats: SKOS (xml), Word, and HTML. In the ACM Digital Library, the CCS is presented in a visual display format that facilitates navigation and feedback.

Will be looking at how the classification has changed since 1998. And since we have so much data online, should not be all that hard to see how well 1998 categories work for 1988, or 1977?

All for a classification that is “current and relevant.”

Still, don’t want papers dropping off the edge of the semantic world due to changes in classification.

Legal Rules, Text and Ontologies Over Time [The eternal “now?”]

Monday, September 3rd, 2012

Legal Rules, Text and Ontologies Over Time by Monica Palmirani, Tommaso Ognibene and Luca Cervone.


The current paper presents the “Fill the gap” project that aims to design a set of XML standards for modelling legal documents in the Semantic Web over time. The goal of the project is to design an information system using XML standards able to store in an XML-native database legal resources and legal rules in an integrated way for supporting legal knowledge engineers and end-users (e.g., public administrative officers, judges, citizens).

It was refreshing to read:

The law changes over time and consequently change the rules and the ontological classes (e.g., the definition of EU citizenship changed in 2004 with the annexation of 10 new member states in the European Community). It is also fundamental to assign dates to the ontology and to the rules, , based on an analytical approach, to the text, and analyze the relationships among sets of dates. The semantic web cake recommends that content, metadata should be modelled and represented in separate and clean layers. This recommendation is not widely followed from too many XML schemas, including those in the legal domain. The layers of content and rules are often confused to pursue a short annotation syntax, or procedural performance parameters or simply because a neat analysis of the semantic and abstract components is missing.

Not being mindful of time, of the effective date of changes to laws, the dates of events/transactions, can be hazardous to your pocketbook and/or your freedom!

Does your topic map account for time or does it exist in an eternal “now?” like the WWW?

I first saw this at Legal Informatics.

Community Based Annotation (mapping?)

Thursday, August 2nd, 2012

Enabling authors to annotate their articles is examined in: Assessment of community-submitted ontology annotations from a novel database-journal partnership by Tanya Z. Berardini, Donghui Li, Robert Muller, Raymond Chetty, Larry Ploetz, Shanker Singh, April Wensel and Eva Huala.


As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles’ contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed.

We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality.

It is encouraging that this annotation effort started with the persons most likely to know the correct answers, authors of the papers in question.

The low initial participation rate (16%) and improved after email reminder rate (53%), were less encouraging.

I suspect unless and until prior annotation practices (by researchers) becomes a line item on current funding requests (how many annotations were accepted by publishers of your prior research?), we will continue to see annotations to be a low priority item.

Perhaps I should suggest that as a study area for the NIH?

Publishers, researchers who build annotation software, annotated data sources and their maintainers, are all likely to be interested.

Would you be interested as well?

The Ontology for Biomedical Investigations (OBI)

Sunday, July 15th, 2012

The Ontology for Biomedical Investigations (OBI)

From the webpage:

The Ontology for Biomedical Investigations (OBI) project is developing an integrated ontology for the description of biological and clinical investigations. This includes a set of ‘universal’ terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will represent the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type analysis performed on it. Currently OBI is being built under the Basic Formal Ontology (BFO).

  • Develop an Ontology for Biomedical Investigations in collaboration with groups representing different biological and technological domains involved in Biomedical Investigations
  • Make OBI compatible with other bio-ontologies
  • Develop OBI using an open source approach
  • Create a valuable resource for the biomedical communities to provide a source of terms for consistent annotation of investigations

An ontology that will be of interest if you are integrating biomedical materials.

At least as a starting point.

My listing of ontologies, vocabularies, etc., for any field are woefully incomplete for any field and represent at best starting points for your own, more comprehensive investigations. If you do find these starting points useful, please send pointers to your more complete investigations for any field.

An XML-Format for Conjectures in Geometry (Work-in-Progress)

Saturday, July 14th, 2012

An XML-Format for Conjectures in Geometry (Work-in-Progress) by Pedro Quaresma.


With a large number of software tools dedicated to the visualisation and/or demonstration of properties of geometric constructions and also with the emerging of repositories of geometric constructions, there is a strong need of linking them, and making them and their corpora, widely usable. A common setting for interoperable interactive geometry was already proposed, the i2g format, but, in this format, the conjectures and proofs counterparts are missing. A common format capable of linking all the tools in the field of geometry is missing. In this paper an extension of the i2g format is proposed, this extension is capable of describing not only the geometric constructions but also the geometric conjectures. The integration of this format into the Web-based GeoThms, TGTP and Web Geometry Laboratory systems is also discussed.

The author notes open questions as:

  • The xml format must be complemented with an extensive set of converters allowing the exchange of information between as many geometric tools as possible.
  • The databases queries, as in TGTP, raise the question of selecting appropriate keywords. A fine grain index and/or an appropriate geometry ontology should be addressed.
  • The i2gatp format does not address proofs. Should we try to create such a format? The GATPs produce proofs in quite different formats, maybe the construction of such unifying format it is not possible and/or desirable in this area.

The “keywords,” “fine grained index,” “geometry ontology,” question yells “topic map” to me.


PS: Converters and different formats also say “topic map,” just not as loudly to me. Your volume may vary. (YVMV)


Broccoli: Semantic Full-Text Search at your Fingertips

Friday, July 13th, 2012

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, and Elmar Haussmann.


We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g. edible), classes (e.g. plants), instances (e.g. Broccoli), and relations (e.g. occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully-functional prototype based on our ideas, see this http URL

It’s good to see CS projects work so hard to find unambiguous names. That won’t be confused with far more common uses of the same names. 😉

For all that, on quick review it does look like a clever, if annoyingly named, project.

Hmmm, doesn’t like the “-” (hyphen) character. “graph-theoretical tree” returns 0 results, “graph theoretical tree” returns 1 (the expected one).

Definitely worth a close read.

One puzzle though. There are a number of projects that use Wikipedia data dumps. The problem is most of the documents I am interested in searching aren’t in Wikipedia data dumps. Like the Enron emails.

Techniques that work well with clean data may work less well with documents composed of the vagaries of human communication. Or attempts at communication.

Semantator: annotating clinical narratives with semantic web ontologies

Thursday, July 12th, 2012

Semantator: annotating clinical narratives with semantic web ontologies by Dezhao Song, Christopher G. Chute, and Cui Tao. (AMIA Summits Transl Sci Proc. 2012;2012:20-9. Epub 2012 Mar 19.)


To facilitate clinical research, clinical data needs to be stored in a machine processable and understandable way. Manual annotating clinical data is time consuming. Automatic approaches (e.g., Natural Language Processing systems) have been adopted to convert such data into structured formats; however, the quality of such automatically extracted data may not always be satisfying. In this paper, we propose Semantator, a semi-automatic tool for document annotation with Semantic Web ontologies. With a loaded free text document and an ontology, Semantator supports the creation/deletion of ontology instances for any document fragment, linking/disconnecting instances with the properties in the ontology, and also enables automatic annotation by connecting to the NCBO annotator and cTAKES. By representing annotations in Semantic Web standards, Semantator supports reasoning based upon the underlying semantics of the owl:disjointWith and owl:equivalentClass predicates. We present discussions based on user experiences of using Semantator.

If you are an AMIA member, see above for the paper. If not, see: Semantator: annotating clinical narratives with semantic web ontologies (PDF file). And the software/webpage: Semantator.

Software is a plugin for Protege 4.1 or higher.

Looking at the extensive screen shots at the website, which has good documentation, the first question I would ask a potential user is: “Are you comfortable with Protege?” If they aren’t I suspect you are going to invest a lot of time in teaching them ontologies and Protege. Just an FYI.

Complex authoring tools, particularly for newbies, seem like a non-starter to me. For example, why not have a standalone entity extractor (but don’t call it that, call it “I See You (ISY)) that uses a preloaded entity file to recognize entities in a text. Where there is uncertainty, those are displayed in a different color, with drop down options on possible other entities. User get to pick one from the list (no write in ballots). Performs a step towards getting clean data for a second round with another one-trick-pony tool. User contributes, we all benefit.

Which brings me to the common shortfall of annotation solutions: the requirement that the text to be annotated be in plain text.

There are lot of “text” documents but what of those in Word, PDF, Postscript, PPT, Excel, to say nothing of other formats?

The past will not disappear for want of a robust annotation solution.

Nor should it.

Knowledge Design Patterns

Saturday, June 16th, 2012

Knowledge Design Patterns

John Sowa announced these slides as:

Last week, I presented a 3-hour tutorial on Knowledge Design Patterns at the Semantic Technology Conference in San Francisco. Following are the slides:

The talk was presented on June 4, but these are the June 10th version of the slides. They include a few revisions and extensions, which I added to clarify some of the issues and to answer some of the questions that were asked during the presentation.

And John posted an outline of the 130 slides:

Outline of This Tutorial

1. What are knowledge design patterns?
2. Foundations of ontology.
3. Syllogisms, categorical and hypothetical.
4. Patterns of logic.
5. Combining logic and ontology.
6. Patterns of patterns of patterns.
7. Simplifying the user interface.

Particularly if you have never seen a Sowa presentation, take a look at the slides.

Capturing…Quantitative and Semantic Information in Radiology Images

Tuesday, June 5th, 2012

Daniel Rubin from Stanford University on “Capturing and Computer Reasoning with Quantitative and Semantic Information in Radiology Images” at 10:00am PT, Wednesday, June 6.


The use of semantic Web technologies to make the myriad of data in cyberspace accessible to intelligent agents is well established. However, a crucial type of information on the Web–and especially in life sciences–is imaging, which is largely being overlooked in current semantic Web endeavors. We are developing methods and tools to enable the transparent discovery and use of large distributed collections of medical images within hospital information systems and ultimately on the Web. Our approach is to make the human and machine descriptions of image content machine-accessible through “semantic annotation” using ontologies, capturing semantic and quantitative information from images as physicians view them in a manner that minimally affects their current workflow. We exploit new standards for making image contents explicit and publishable on the semantic Web. We will describe tools and methods we are developing and preliminary results using them for response assessment in cancer. While this work is focused on images in the life sciences, it has broader applicability to all images on the Web. Our ultimate goal is to enable semantic integration of images and all the related scientific data pertaining to their content so that physicians and basic scientists can have the best understanding of the biological and physiological significance of image content.


Daniel L. Rubin, MD, MS is Assistant Professor of Radiology and Medicine (Biomedical Informatics Research) at Stanford University. He is a Member of the Stanford Cancer Center and the Bio-X interdisciplinary research program. His NIH-funded research program focuses on the intersection of biomedical informatics and imaging science, developing computational methods and applications to extract quantitative information and meaning from clinical, molecular, and imaging data, and to translate these methods into practice through applications to improve diagnostic accuracy and clinical effectiveness. He is Principal Investigator of one of the centers in the National Cancer Institute’s recently-established Quantitative Imaging Network (QIN), Chair of the RadLex Steering Committee of the Radiological Society of North America (RSNA), and Chair of the Informatics Committee of the American College of Radiology Imaging Network (ACRIN). Dr. Rubin has published over 100 scientific publications in biomedical imaging informatics and radiology.

To start or join the online meeting
Go to

Audio conference information
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers:

Access code:925 343 903

Whether you are using topic maps for image annotation or mapping between systems of image annotation, this promises to be an interesting presentation.

Wyner and Hoekstra on A Legal Case OWL Ontology with an Instantiation of Popov v. Hayashi

Friday, March 16th, 2012

Wyner and Hoekstra on A Legal Case OWL Ontology with an Instantiation of Popov v. Hayashi

From Legalinformatics:

Dr. Adam Wyner of the University of Leeds Centre for Digital Citizenship and Dr. Rinke Hoekstra of the University of Amsterdam’s Leibniz Center for Law have published A legal case OWL ontology with an instantiation of Popov v. Hayashi, forthcoming in Artificial Intelligence and Law. Here is the abstract:

The legal case ontology here.

I have a history with logic and the law that stretches over decades. Rather than comment now, interested in what you think? What is strong/weak about this proposal?