Archive for the ‘Ontology’ Category

YAGO: A High-Quality Knowledge Base (Open Source)

Saturday, October 28th, 2017

YAGO: A High-Quality Knowledge Base

Overview:

YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.

YAGO is special in several ways:

  1. The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value.
  2. YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.
  3. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and entities.
  4. In addition to a taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.
  5. YAGO extracts and combines entities and facts from 10 Wikipedias in different languages.

YAGO is developed jointly with the DBWeb group at Télécom ParisTech University.

Before you are too impressed by the numbers, which are impressive, realize that 10 million entities is 3% of the current US population. To say nothing of any other entities we might want include along with them. It’s a good start and very useful, but realize it is a limited set of entities.

All the source data is available, along with the source code.

Would be interesting to see how useful the entity set is when used with US campaign contribution data.

Thoughts?

Spontaneous Preference for their Own Theories (SPOT effect) [SPOC?]

Thursday, February 4th, 2016

The SPOT Effect: People Spontaneously Prefer their Own Theories by Aiden P. Gregga, Nikhila Mahadevana, and Constantine Sedikidesa.

Abstract:

People often exhibit confirmation bias: they process information bearing on the truth of their theories in a way that facilitates their continuing to regard those theories as true. Here, we tested whether confirmation bias would emerge even under the most minimal of conditions. Specifically, we tested whether drawing a nominal link between the self and a theory would suffice to bias people towards regarding that theory as true. If, all else equal, people regard the self as good (i.e., engage in self-enhancement), and good theories are true (in accord with their intended function), then people should regard their own theories as true; otherwise put, they should manifest a Spontaneous Preference for their Own Theories (i.e., a SPOT effect). In three experiments, participants were introduced to a theory about which of two imaginary alien species preyed upon the other. Participants then considered in turn several items of evidence bearing on the theory, and each time evaluated the likelihood that the theory was true versus false. As hypothesized, participants regarded the theory as more likely to be true when it was arbitrarily ascribed to them as opposed to an “Alex” (Experiment 1) or to no one (Experiment 2). We also found that the SPOT effect failed to converge with four different indices of self-enhancement (Experiment 3), suggesting it may be distinctive in character.

I can’t give you the details on this article because it is fire-walled.

But the catch phrase, “Spontaneous Preference for their Own Theories (i.e., a SPOT effect)” certainly fits every discussion of semantics I have ever read or heard.

With a little funding you could prove the corollary, Spontaneous Preference for their Own Code (the SPOC effect) among programmers. 😉

There are any number of formulations for how to fight confirmation bias but Jeremy Dean puts it this way:


The way to fight the confirmation bias is simple to state but hard to put into practice.

You have to try and think up and test out alternative hypothesis. Sounds easy, but it’s not in our nature. It’s no fun thinking about why we might be misguided or have been misinformed. It takes a bit of effort.

It’s distasteful reading a book which challenges our political beliefs, or considering criticisms of our favourite film or, even, accepting how different people choose to live their lives.

Trying to be just a little bit more open is part of the challenge that the confirmation bias sets us. Can we entertain those doubts for just a little longer? Can we even let the facts sway us and perform that most fantastical of feats: changing our minds?

I wonder if that includes imagining using JSON? (shudder) 😉

Hard to do, particularly when we are talking about semantics and what we “know” to be the best practices.

Examples of trying to escape the confirmation bias trap and the results?

Perhaps we can encourage each other.

Street-Fighting Mathematics – Free Book – Lesson For Semanticists?

Friday, January 1st, 2016

Street-Fighting Mathematics: The Art of Educated Guessing and Opportunistic Problem Solving by Sanjoy Mahajan.

From the webpage:

street-fighting

In problem solving, as in street fighting, rules are for fools: do whatever works—don’t just stand there! Yet we often fear an unjustified leap even though it may land us on a correct result. Traditional mathematics teaching is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions. This engaging book is an antidote to the rigor mortis brought on by too much mathematical rigor, teaching us how to guess answers without needing a proof or an exact calculation.

In Street-Fighting Mathematics, Sanjoy Mahajan builds, sharpens, and demonstrates tools for educated guessing and down-and-dirty, opportunistic problem solving across diverse fields of knowledge—from mathematics to management. Mahajan describes six tools: dimensional analysis, easy cases, lumping, picture proofs, successive approximation, and reasoning by analogy. Illustrating each tool with numerous examples, he carefully separates the tool—the general principle—from the particular application so that the reader can most easily grasp the tool itself to use on problems of particular interest. Street-Fighting Mathematics grew out of a short course taught by the author at MIT for students ranging from first-year undergraduates to graduate students ready for careers in physics, mathematics, management, electrical engineering, computer science, and biology. They benefited from an approach that avoided rigor and taught them how to use mathematics to solve real problems.

I have just started reading Street-Fighting Mathematics but I wonder if there is a parallel between mathematics and the semantics that everyone talks about capturing from information systems.

Consider this line:

Traditional mathematics teaching is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions.

And re-cast it for semantics:

Traditional semantics (Peirce, FOL, SUMO, RDF) is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions.

What if the semantics we capture and apply are sufficient for your use case? Complete with ROI for that use case.

Is that sufficient?

Santa Claus is Real

Friday, December 25th, 2015

Santa Claus is Real by Johnathan Korman.

I won’t try to summarize Korman’s post but will quote a snippet to entice you to read it in full:


Santa Claus is as real as I am.

Santa is, in truth, more real than I am. He has a bigger effect on the world.

After all, how many people know Santa Claus? If I walk down Market Street in San Francisco, there’s a good chance that a few people will recognize me; I happen to be a distinctive-looking guy. There’s a chance that one or two of those people will even know my name and a few things about me, but the odds are greatly against it. But if Santa takes the same walk, everybody (or nearly everybody) will recognize him, know his name, know a number of things about him, even have personal stories about him. So who is more real?

Enjoy!

UK – Investigatory Powers Bill – Volunteer Targets

Friday, November 27th, 2015

I saw a tweet earlier today that indicates the drafters of the UK Investigatory Powers Bill have fouled themselves, again.

Section 195, General Definitions (1) has a list of unnumbered definitions which includes:

“data” includes any information which is not data,

However creative the English courts may be, I think that passage is going to prove to be a real challenge.

Which makes even more worried than I was before.

A cleanly drafted bill that strips every citizen of the UK of their rights presents a well-defined target for opposition.

In this semantic morass, terms could mean what they say, the opposite and also be slang for a means of execution.

Because of the Paris bombings, there is a push on to approve something, anything, to be seen as taking steps against terrorism.

Instead of the Investigatory Powers Bill, Parliament should acquire 5 acres of land outside of London and erect a podium at its center. Members of Parliament will take turns reading Shakespeare aloud for two hours, eight hours a day, every day of the year.

Terrorists prefer high-value targets over low and so members of Parliament can save all the people of the UK from fearing terrorists attacks.

Their presence as targets will attract terrorists and simplify the task of locating potential terrorists.

Any member of parliament who is killed while reading Shakespeare at the designated location, should be posthumously made a peer of the realm.

A bill like that would protect the rights of every citizen of the UK, assist in the hunting of terrorist be drawing them to a common location and help prevent future crimes against the English language as are found in the Investigatory Powers Bill. What’s there not to like?

Collaborative Annotation for Scientific Data Discovery and Reuse [+ A Stumbling Block]

Thursday, July 2nd, 2015

Collaborative Annotation for Scientific Data Discovery and Reuse by Kirk Borne.

From the post:

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (http://zooniverse.org/).

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential – humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.
….

Kirk goes onto say:

…it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies.

Sigh, but we know from experience that has never worked. True, we can develop more common data models, taxonomies and ontologies, but they will be in addition to the present common data models, taxonomies and ontologies. Not to mention that developing knowledge is going to lead to future common data models, taxonomies and ontologies.

If you don’t believe me, take a look at: Library of Congress Subject Headings Tentative Monthly List 07 (July 17, 2015). These subject headings have not yet been approved but they are in addition to existing subject headings.

The most recent approved list: Library of Congress Subject Headings Monthly List 05 (May 18, 2015). For approved lists going back to 1997, see: Library of Congress Subject Headings (LCSH) Approved Lists.

Unless you are working in some incredibly static and sterile field, the basic terms that are found in “common data models, taxonomies and ontologies” are going to change over time.

The only sure bet in the area of knowledge and its classification is that change is coming.

But, Kirk is right, common data models, taxonomies and ontologies are useful. So how do we make them more useful in the face of constant change?

Why not use topics to model elements/terms of common data models, taxonomies and ontologies? Which would enable user to search across such elements/terms by the properties of those topics. Possibly discovering topics that represent the same subject under a different term or element.

Imagine working on an update of a common data model, taxonomy or ontology and not having to guess at the meaning of bare elements or terms? A wealth of information, including previous elements/terms for the same subject being present at each topic.

All of the benefits that Kirk claims would accrue, plus empowering users who only know previous common data models, taxonomies and ontologies, to say nothing of easing the transition to future common data models, taxonomies and ontologies.

Knowledge isn’t static. Our methodologies for knowledge classification should be as dynamic as the knowledge we seek to classify.

SciGraph

Tuesday, June 9th, 2015

SciGraph

From the webpage:

SciGraph aims to represent ontologies and data described using ontologies as a Neo4j graph. SciGraph reads ontologies with owlapi and ingests ontology formats available to owlapi (OWL, RDF, OBO, TTL, etc). Have a look at how SciGraph translates some simple ontologies.

Goals:

  • OWL 2 Support
  • Provide a simple, usable, Neo4j representation
  • Efficient, parallel ontology ingestion
  • Provide basic “vocabulary” support
  • Stay domain agnostic

Non-goals:

  • Create ontologies based on the graph
  • Reasoning support

Some applications of SciGraph:

  • the Monarch Initiative uses SciGraph for both ontologies and biological data modeling [repaired link] [Monarch enables navigation across a rich landscape of phenotypes, diseases, models, and genes for translational research.]
  • SciCrunch uses SciGraph for vocabulary and annotation services [biomedical but also has US patents?]
  • CINERGI uses SciGraph for vocabulary and annotation services [Community Inventory of EarthCube Resources for Geosciences Interoperability, looks very ripe for a topic map discussion]

If you are interested in representation, modeling or data integration with ontologies, you definitely need to take a look at SciGraph.

Enjoy!

GOLD (General Ontology for Linguistic Description) Standard

Thursday, February 19th, 2015

GOLD (General Ontology for Linguistic Description) Standard

From the homepage:

The purpose of the GOLD Community is to bring together scholars interested in best-practice encoding of linguistic data. We promote best practice as suggested by E-MELD, encourage data interoperability through the use of the GOLD Standard, facilitate search across disparate data sets and provide a platform for sharing existing data and tools from related research projects. The development and refinement of the GOLD Standard will be the basis for and the product of the combined efforts of the GOLD Community. This standard encompasses linguistic concepts, definitions of these concepts and relationships between them in a freely available ontology.

The GOLD standard is dated 2010 and I didn’t see any updates for it.

If you are interested in capturing the subject identity properties before new nomenclatures replace the ones found here now would be a good time.

I first saw this in a tweet by the Linguist List.

Storyline Ontology

Thursday, October 16th, 2014

Storyline Ontology

From the post:

The News Storyline Ontology is a generic model for describing and organising the stories news organisations tell. The ontology is intended to be flexible to support any given news or media publisher’s approach to handling news stories. At the heart of the ontology, is the concept of Storyline. As a nuance of the English language the word ‘story’ has multiple meanings. In news organisations, a story can be an individual piece of content, such as an article or news report. It can also be the editorial view on events occurring in the world.

The journalist pulls together information, facts, opinion, quotes, and data to explain the significance of world events and their context to create a narrative. The event is an award being received; the story is the triumph over adversity and personal tragedy of the victor leading up to receiving the reward (and the inevitable fall from grace due to drugs and sexual peccadillos). Or, the event is a bombing outside a building; the story is an escalating civil war or a gas mains fault due to cost cutting. To avoid this confusion, the term Storyline has been used to remove the ambiguity between the piece of creative work (the written article) and the editorial perspective on events.

Storyline ontology

I know, it’s RDF. Well, but the ontology itself, aside from the RDF cruft, represents a thought out and shared view of story development by major news producers. It is important for that reason if no other.

And you can use it as the basis for developing or integrating other story development ontologies.

Just as the post acknowledges:

As news stories are typically of a subjective nature (one news publisher’s interpretation of any given news story may be different from another’s), Storylines can be attributed to some agent to provide this provenance.

the same is true for ontologies. Ready to claim credit/blame for yours?

FOAM (Functional Ontology Assignments for Metagenomes):…

Wednesday, October 1st, 2014

FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus by Emmanuel Prestat, et al. (Nucl. Acids Res. (2014) doi: 10.1093/nar/gku702 )

Abstract:

A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associated functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.

Aside from its obvious importance for genomics and bioinformatics, I mention this because the authors point out:

A caveat of this approach is that we did not consider the quality of the tree in the tree-splitting step (i.e. weakly supported branches were equally treated as strongly supported ones), producing models of different qualities. Nevertheless, we decided that the approach of rational classification is better than no classification at all. In the future, the groups could be recomputed, or split more optimally when more data become available (e.g. more KOs). From each cluster related to the KO in process, we extracted the alignment from which HMMs were eventually built.

I take that to mean that this “ontology” represents no unchanging ground truth but rather an attempt to enhance the “…screening of environmental metagenomic and metatranscriptomic sequence datasets for functional genes.”

As more information is gained, the present “ontology” can and will change. Those future changes create the necessity to map those changes and the facts that drove them.

I first saw this in a tweet by Jonathan Eisen

Ontology-Based Interpretation of Natural Language

Thursday, July 10th, 2014

Ontology-Based Interpretation of Natural Language by Philipp Cimiano, Christina Unger, John McCrae.

Authors’ description:

For humans, understanding a natural language sentence or discourse is so effortless that we hardly ever think about it. For machines, however, the task of interpreting natural language, especially grasping meaning beyond the literal content, has proven extremely difficult and requires a large amount of background knowledge.

The book Ontology-based interpretation of natural language presents an approach to the interpretation of natural language with respect to specific domain knowledge captured in ontologies. It puts ontologies at the center of the interpretation process, meaning that ontologies not only provide a formalization of domain knowlegde necessary for interpretation but also support and guide the construction of meaning representations.

The links under Resources for Ontologies, Lexica and Grammars, as of today return “coming soon.”

Implementations fares a bit better, returning information on various aspects of lemon.

lemon is a proposed meta-model for describing ontology lexica with RDF. It is declarative, thus abstracts from specific syntactic and semantic theories, and clearly separates lexicon and ontology. It follows the principle of semantics by reference, which means that the meaning of lexical entries is specified by pointing to elements in the ontology.

lemon-core

It may just be me but the Lemon model seems more complicated than asking users what identifies their subjects and distinguishes them from other subjects.

Lemon is said to be compatible with RDF, OWL, SPARQL, etc.

But, accurate (to a user) identification of subjects and their relationships to other subjects is more important to me than compatibility with RDF, SPARQL, etc.

You?

I first saw this in a tweet by Stefano Bertolo.

A crowdsourcing approach to building a legal ontology from text

Tuesday, May 27th, 2014

A crowdsourcing approach to building a legal ontology from text by Anatoly P. Getman and Volodymyr V. Karasiuk.

Abstract:

This article focuses on the problems of application of artificial intelligence to represent legal knowledge. The volume of legal knowledge used in practice is unusually large, and therefore the ontological knowledge representation is proposed to be used for semantic analysis, presentation and use of common vocabulary, and knowledge integration of problem domain. At the same time some features of legal knowledge representation in Ukraine have been taken into account. The software package has been developed to work with the ontology. The main features of the program complex, which has a Web-based interface and supports multi-user filling of the knowledge base, have been described. The crowdsourcing method is due to be used for filling the knowledge base of legal information. The success of this method is explained by the self-organization principle of information. However, as a result of such collective work a number of errors are identified, which are distributed throughout the structure of the ontology. The results of application of this program complex are discussed in the end of the article and the ways of improvement of the considered technique are planned.

Curious how you would compare this attempt to extract an ontology from legal texts to the efforts in the 1960’s and 1970’s to extract logic from the United States Internal Revenue Code? Apologies but my undergraduate notes aren’t accessible so I can’t give you article titles and citations.

If you do dig out some of that literature, pointers would be appreciated. As I recall, capturing the “logic” of those passages was fraught with difficulty.

BBC: An Honest Ontology

Thursday, April 24th, 2014

British Broadcasting Corporation launches an Ontology page

From the post:

The Britishi Braodcasting Corporation (BBC) has launced a new page detailing their internal data models. The page provides access to the ontologies the BBC is using to support its audience facing applications such as BBC Sport, BBC Education, BBC Music, News projects and more. These ontologies form the basis of their Linked Data Platform. The listed ontologies include the following;

I think my favorite is:

Core Concepts Ontology -The generic BBC ontology for people, places,events, organisations, themes which represent things that make sense across the BBC. (emphasis added)

I don’t think you can ask for a fairer statement from an ontology than: “which represent things that make sense across the BBC.”

And that’s all any ontology can do. Represent things that make sense in a particular context.

What I wish the BBC ontology did more of (along with other ontologies), is to specify what is required to recognize one of its “things.”

For example, person has these properties: “dateOfBirth, dateOfDeath, gender, occupation, placeOfBirth, placeOfDeath.”

We can ignore “dateOfBirth, dateOfDeath, … placeOfBirth, placeOfDeath” because those would not distinguish a person from a zoo animal, for instance. Ditto for gender.

So, is “occupation” the sole property by which I can distinguish a person from other entities that can have “dateOfBirth, dateOfDeath, gender, …, placeOfBirth, placeOfDeath” properties?

Noting that “occupation” is described as:

This property associates a person with a thematic area he or she worked in, for example Annie Lennox with Music.

BTW, the only property of “theme” is “occupation” and “thematic area” is undefined.

Works if you share an understanding with the BBC about “occupation” and/or don’t want to talk about the relationship between Annie Lennox and Music.

Of course, without more properties, it is hard to know exactly what the BBC means by “thematic area.” That’s ok if you are only using the BBC ontology or if the ambiguity of what is meant is tolerable for your application. Not so ok if you want to map precisely to what the BBC may or may not have meant.

But I do appreciate the BBC being honest about its ontology “…mak[ing] sense across the BBC.

VOWL: Visual Notation for OWL Ontologies

Friday, April 18th, 2014

VOWL: Visual Notation for OWL Ontologies

Abstract:

The Visual Notation for OWL Ontologies (VOWL) defines a visual language for the user-oriented representation of ontologies. It provides graphical depictions for elements of the Web Ontology Language (OWL) that are combined to a force-directed graph layout visualizing the ontology.

This specification focuses on the visualization of the ontology schema (i.e. the classes, properties and datatypes, sometimes called TBox), while it also includes recommendations on how to depict individuals and data values (the ABox). Familiarity with OWL and other Semantic Web technologies is required to understand this specification.

At the end of the specification there is an interesting example but as a “force-directed graph layout” it captures one of the difficulties I have with that approach.

I have this unreasonable notion that a node I select and place in the display should stay where I have placed it, not shift about because I have moved some other node. Quite annoying and I don’t find it helpful at all.

I first saw this at: VOWL: Visual Notation for OWL Ontologies

Prescription vs. Description

Saturday, April 12th, 2014

Kurt Cagle posted this image on Facebook:

engineers-vs-physicists

with this comment:

The difference between INTJs and INTPs in a nutshell. Most engineers, and many programmers, are INTJs. Theoretical scientists (and I’m increasingly putting data scientists in that category) are far more INTPs – they are observers trying to understand why systems of things work, rather than people who use that knowledge to build, control or constrain those systems.

I would rephrase the distinction to be one of prescription (engineers) versus description (scientists) but that too is a false dichotomy.

You have to have some real or imagined description of a system to start prescribing for it and any method for exploring a system has some prescriptive aspects.

The better course is to recognize exploring or building systems has some aspects of both. Making that recognition, may (or may not) make it easier to discuss assumptions of either perspective that aren’t often voiced.

Being more from the descriptive side of the house, I enjoy pointing out that behind most prescriptive approaches are software and services to help you implement those prescriptions. Hardly seems like an unbiased starting point to me. 😉

To be fair, however, the descriptive side of the house often has trouble distinguishing between important things to describe and describing everything it can to system capacity, for fear of missing some edge case. The “edge” cases may be larger than the system but if they lack business justification, pragmatics should reign over purity.

Or to put it another way: Prescription alone is too brittle and description alone is too endless.

Effective semantic modeling/integration needs to consist of varying portions of prescription and description depending upon the requirements of the project and projected ROI.

PS: The “ROI” of a project not in your domain, that doesn’t use your data, your staff, etc. is not a measure of the potential “ROI” for your project. Crediting such reports is “ROI” for the marketing department that created the news. Very important to distinguish “your ROI” from “vendor’s ROI.” Not the same thing. If you need help with that distinction, you know where to find me.

Ontology work at the Royal Society of Chemistry

Wednesday, March 19th, 2014

Ontology work at the Royal Society of Chemistry by Antony Williams.

From the description:

We provide an overview of the use we make of ontologies at the Royal Society of Chemistry. Our engagement with the ontology community began in 2006 with preparations for Project Prospect, which used ChEBI and other Open Biomedical Ontologies to mark up journal articles. Subsequently Project Prospect has evolved into DERA (Digitally Enhancing the RSC Archive) and we have developed further ontologies for text markup, covering analytical methods and name reactions. Most recently we have been contributing to CHEMINF, an open-source cheminformatics ontology, as part of our work on disseminating calculated physicochemical properties of molecules via the Open PHACTS. We show how we represent these properties and how it can serve as a template for disseminating different sorts of chemical information.

A bit wordy for my taste but it has numerous references and links to resources. Top stuff!

I had to laugh when I read slide #20:

Why a named reaction ontology?

Despite attempts to introduce systematic nomenclature for organic reactions, lots of chemists still prefer to attach human names.

Those nasty humans! Always wanting “human” names. Grrr! 😉

Afraid so. That is going to continue in a number of disciplines.

When I got to slides #29:

Ontologies as synonym sets for text-mining

it occurred to me that terms in an ontology are like base names in a topic map, in topics with associations with other topics, which also have base name.

The big difference being that ontologies are mono-views that don’t include mapping instructions based on properties in starting ontology or any other ontology to which you could map.

That is the ontologies I have seen can only report properties of their terms and not which properties must be matched to be the same subject.

Nor do such ontologies report properties of the subjects that are their properties. Much less any mappings from bundles of properties to bundles of properties in other ontologies.

I know the usual argument about combinatorial explosion of mappngs, etc., which leaves ontologists with too few arms and legs to point in the various directions.

That argument fails to point out that to have an “uber” ontology, someone has to do the mapping (undisclosed) from variants to the new master ontology. And, they don’t write that mapping down.

So the combinatorial explosion was present, it just didn’t get written down. Can you guess who is empowered as an expert in the new master ontology with undocumented mappings?

The other fallacy in that argument is that topic maps, for example, are always partial world views. I can map as much or as little between ontologies, taxonomies, vocabularies, folksonomies, etc. as I care to do.

If I don’t want to bother mapping “thing” as the root of my topic map, I am free to omit it. All the superstructure clutter goes away and I can focus on immediate ROI concerns.

Unless you want to pay for the superstructure clutter then by all means, I’m interested! 😉

If you have an ontology, by all means use it as a starting point for your topic map. Or if someone is willing to pay to create yet another ontology, do it. But if they need results before years of travel, debate and bad coffee, give topic maps a try!

PS: The travel, debate and bad coffee never go away for ontologies, even successful ones. Despite the desires of many, the world keeps changing and our views of it along with it. A static ontology is a dead ontology. Same is true for a topic map, save that agreement on its content is required only as far as it is used and no further.

Algebraic and Analytic Programming

Monday, March 10th, 2014

Algebraic and Analytic Programming by Luke Palmer.

In a short post Luke does a great job contrasting algebraic versus analytic approaches to programming.

In an even shorter summary, I would say the difference is “truth” versus “acceptable results.”

Oddly enough, that difference shows up in other areas as well.

The major ontology projects, including linked data, are pushing one and only one “truth.”

Versus other approaches, such as topic maps (at least in my view), that tend towards “acceptable results.”

I am not sure what other measure of success you would have other than “acceptable results?”

Or what another measure for a semantic technology would be other than “acceptable results?”

Whether the universal truth of the world folks admit it or not, they just have a different definition of “acceptable results.” Their “acceptable results” means their world view.

I appreciate the work they put into their offer but I have to decline. I already have a world view of my own.

You?

I first saw this in a tweet by Computer Science.

Semantics in Support of Biodiversity Knowledge Discovery:…

Tuesday, March 4th, 2014

Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies by Walls RL, Deck J, Guralnick R, Baskauf S, Beaman R, et al. (2014). (Walls RL, Deck J, Guralnick R, Baskauf S, Beaman R, et al. (2014) Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies. PLoS ONE 9(3): e89606. doi:10.1371/journal.pone.0089606).

Abstract:

The study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.

I want to call to your attention a great description of the current state of biodiversity data:

Assembling the data sets needed for global biodiversity initiatives remains challenging. Biodiversity data are highly heterogeneous, including information about organisms, their morphology and genetics, life history and habitats, and geographical ranges. These data almost always either contain or are linked to spatial, temporal, and environmental data. Biodiversity science seeks to understand the origin, maintenance, and function of this variation and thus requires integrated data on the spatiotemporal dynamics of organisms, populations, and species, together with information on their ecological and environmental context. Biodiversity knowledge is generated across multiple disciplines, each with its own community practices. As a consequence, biodiversity data are stored in a fragmented network of resource silos, in formats that impede integration. The means to properly describe and interrelate these different data sources and types is essential if such resources are to fulfill their potential for flexible use and re-use in a wide variety of monitoring, scientific, and policy-oriented applications [5]. (From the introduction)

Contrast that with the final claim in the abstract:

We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers. (emphasis added)

I am very confident that both of those statements, from the introduction and from the abstract, are as true as human speakers can achieve.

However, the displacement of an unknown number of communities of practice, which vary even within disciplines, to say nothing of between disciplines, by these ontologies, seems highly unlikely. Not to mention planning for the fate of data from soon to be previous community practices.

Or perhaps I should observe that such a displacement has never happened. True, over time a community of practice may die, only to be replaced by another one but I take that as different in kind from an artificial construct that is made by one group and urged upon all others.

Think of it this way, what if the top 100 members of the biodiversity community kept their current community practices but used these ontologies as conversion targets? Followers of those various members could use their community leader’s practice as their conversion target. Reasoning it is easier to follow someone in your own community.

Rather than arguments that will outlast the ontologies that are convenient conversion targets about those ontologies, once a basis for mapping is declared, conversion to any other target becomes immeasurably easier.

Reducing the semantic friction inherent in conversion to an ontology or data format in an investment in the future.

Battling semantic friction for a conversion to an ontology or data format is an investment you will make over and over again.

Ontology Matching

Tuesday, December 24th, 2013

Ontology Matching: Proceedings of the 8th International Workshop on Ontology Matching, co-located with the 12th International Semantic Web Conference (ISWC 2013) edited by Pavel Shvaiko, et. al.

Technical papers:

The Ontology Alignment Evaluation Initiative 2013 results are represented by seventeen (17) papers.

In addition, there are eleven (11) posters.

Complete proceedings in one PDF file.

Ontologies are a popular answer to semantic diversity.

You might say the more semantic diversity in a field, the greater the number of ontologies it has. 😉

A natural consequence of the proliferation of ontologies is the need to match or map between them.

As you know, I prefer to capture the reasons for mappings to avoid repeating the exercise over and over but that’s not a universal requirement.

If you have an hourly contract for mapping between ontologies, you may not want to lessen the burden of such mapping, year in and year out.

And for some purposes, mechanical mappings may be sufficient.

This work is a good update on the current state of the art for matching ontologies.

Cross-categorization of legal concepts…

Tuesday, December 17th, 2013

Cross-categorization of legal concepts across boundaries of legal systems: in consideration of inferential links by Fumiko Kano Glückstad, Tue Herlau, Mikkel N. Schmidt, Morten Mørup.

Abstract:

This work contrasts Giovanni Sartor’s view of inferential semantics of legal concepts (Sartor in Artif Intell Law 17:217–251, 2009) with a probabilistic model of theory formation (Kemp et al. in Cognition 114:165–196, 2010). The work further explores possibilities of implementing Kemp’s probabilistic model of theory formation in the context of mapping legal concepts between two individual legal systems. For implementing the legal concept mapping, we propose a cross-categorization approach that combines three mathematical models: the Bayesian Model of Generalization (BMG; Tenenbaum and Griffiths in Behav Brain Sci 4:629–640, 2001), the probabilistic model of theory formation, i.e., the Infinite Relational Model (IRM) first introduced by Kemp et al. (The twenty-first national conference on artificial intelligence, 2006, Cognition 114:165–196, 2010) and its extended model, i.e., the normal-IRM (n-IRM) proposed by Herlau et al. (IEEE International Workshop on Machine Learning for Signal Processing, 2012). We apply our cross-categorization approach to datasets where legal concepts related to educational systems are respectively defined by the Japanese- and the Danish authorities according to the International Standard Classification of Education. The main contribution of this work is the proposal of a conceptual framework of the cross-categorization approach that, inspired by Sartor (Artif Intell Law 17:217–251, 2009), attempts to explain reasoner’s inferential mechanisms.

From the introduction:

An ontology is traditionally considered as a means for standardizing knowledge represented by different parties involved in communications (Gruber 1992; Masolo et al. 2003; Declerck et al. 2010). Kemp et al. (2010) also points out that some scholars (Block 1986; Field 1977; Quilian 1968) have argued the importance of knowledge structuring, i.e., ontologies, where concepts are organized into systems of relations and the meaning of a concept partly depends on its relationships to other concepts. However, real human to human communication cannot be absolutely characterized by such standardized representations of knowledge. In Kemp et al. (2010), two challenging issues are raised against such idea of systems of concepts. First, as Fodor and Lepore (1992) originally pointed out, it is beyond comprehension that the meaning of any concept can be defined within a standardized single conceptual system. It is unrealistic that two individuals with different beliefs have common concepts. This issue has also been discussed in semiotics (Peirce 2010; Durst-Andersen 2011) and in cognitive pragmatics (Sperber and Wilson 1986). For example, Sperber and Wilson (1986) discuss how mental representations are constructed diversely under different environmental and cognitive conditions. A second point which Kemp et al. (2010) specifically address in their framework is the concept acquisition problem. According to Kemp et al. (2010; see also: Hempel (1985), Woodfield (1987)):

if the meaning of each concept depends on its role within a system of concepts, it is difficult to see how a learner might break into the system and acquire the concepts that it contains. (Kemp et al. 2010)

Interestingly, the similar issue is also discussed by legal information scientists. Sartor (2009) argues that:

legal concepts are typically encountered in the context of legal norms, and the issue of determining their content cannot be separated from the issue of identifying and interpreting the norms in which they occur, and of using such norms in legal inference. (Sartor 2009)

This argument implies that if two individuals who are respectively belonging to two different societies having different legal systems, they might interpret a legal term differently, since the norms in which the two individuals belong are not identical. The argument also implies that these two individuals must have difficulties in learning a concept contained in the other party’s legal system without interpreting the norms in which the concept occurs.

These arguments motivate us to contrast the theoretical work presented by Sartor (2009) with the probabilistic model of theory formation by Kemp et al. (2010) in the context of mapping legal concepts between two individual legal systems. Although Sartor’s view addresses the inferential mechanisms within a single legal system, we argue that his view is applicable in a situation where a concept learner (reasoner) is, based on the norms belonging to his or her own legal system, going to interpret and adapt a new concept introduced from another legal system. In Sartor (2009), the meaning of a legal term results from the set of inferential links. The inferential links are defined based on the theory of Ross (1957) as:

  1. the links stating what conditions determine the qualification Q (Q-conditioning links), and
  2. the links connecting further properties to possession of the qualification Q (Q-conditioned links.) (Sartor 2009)

These definitions can be seen as causes and effects in Kemp et al. (2010). If a reasoner is learning a new legal concept in his or her own legal system, the reasoner is supposed to seek causes and effects identified in the new concept that are common to the concepts which the reasoner already knows. This way, common-causes and common-effects existing within a concept system, i.e., underlying relationships among domain concepts, are identified by a reasoner. The probabilistic model in Kemp et al. (2010) is supposed to learn these underlying relationships among domain concepts and identify a system of legal concepts from a view where a reasoner acquires new concepts in contrast to the concepts already known by the reasoner.

Pardon the long quote but the paper is pay-per-view.

I haven’t started to run down all the references but this is an interesting piece of work.

I was most impressed by the partial echoing of the topic map paradigm that: “meaning of each concept depends on its role within a system of concepts….

True, a topic map can capture only “surface” facts and relationships between those facts but that merits a comment on a topic map instance and not topic maps in general.

Noting that you also shouldn’t pay for more topic map than you need. If all you need is a flat mapping between DHS and say the CIA, then doing nor more than mapping terms is sufficient. If you need a maintainable and robust mapping, different techniques would be called for. Both results would be topic maps, but one of them would be far more useful.

One of the principal sources relied upon by the authors’ is: The Nature of Legal Concepts: Inferential Nodes or Ontological Categories? by Giovanni Sartor.

I don’t see any difficulty with Sartor’s rules of inference, any more than saying if a topic has X property (occurrence in TMDM speak), then of necessity it must have property E, F, and G.

Where I would urge caution is with the notion that properties of a legal concept spring from a legal text alone. Or even from a legal ontology. In part because two people in the same legal system can read the same legal text and/or use the same legal ontology and expect to see different properties for a legal concept.

Consider the text of Paradise Lost by John Milton. If Stanley Fish, a noted Milton scholar, were to assign properties to the concepts in Book 1, his list of properties would be quite different from my list of properties. Same words, same text, but very different property lists.

To refine what I said about the topic map paradigm a bit earlier, it should read: “meaning of each concept depends on its role within a system of concepts [and the view of its hearer/reader]….

The hearer/reader being the paramount consideration. Without a hearer/reader, there is no concept or system of concepts or properties of either one for comparison.

When topics are merged, there is a collecting of properties, some of which you may recognize and some of which I may recognize, as identifying some concept or subject.

No guarantees but better than repeating your term for a concept over and over again, each time in a louder voice. 😉

Ten Quick Tips for Using the Gene Ontology

Tuesday, November 26th, 2013

Ten Quick Tips for Using the Gene Ontology by Judith A. Blake.

From the post:

The Gene Ontology (GO) provides core biological knowledge representation for modern biologists, whether computationally or experimentally based. GO resources include biomedical ontologies that cover molecular domains of all life forms as well as extensive compilations of gene product annotations to these ontologies that provide largely species-neutral, comprehensive statements about what gene products do. Although extensively used in data analysis workflows, and widely incorporated into numerous data analysis platforms and applications, the general user of GO resources often misses fundamental distinctions about GO structures, GO annotations, and what can and can not be extrapolated from GO resources. Here are ten quick tips for using the Gene Ontology.

Tip 1: Know the Source of the GO Annotations You Use

Tip 2: Understand the Scope of GO Annotations

Tip 3: Consider Differences in Evidence Codes

Tip 4: Probe Completeness of GO Annotations

Tip 5: Understand the Complexity of the GO Structure

Tip 6: Choose Analysis Tools Carefully

Tip 7: Provide the Version of the Data/Tools Used

Tip 8: Seek Input from the GOC Community and Make Use of GOC Resources

Tip 9: Contribute to the GO

Tip 10: Acknowledge the Work of the GO Consortium

See Judith’s article for her comments and pointers under each tip.

The take away here is that an ontology may have the information you are looking for, but understanding what you have found is an entirely different matter.

For GO, follow Judith’s specific suggestions/tips, for any other ontology, take steps to understand the ontology before relying upon it.

I first saw this in a tweet by Stephen Turner.

BARTOC launched : A register for vocabularies

Friday, November 15th, 2013

BARTOC launched : A register for vocabularies by Sarah Dister

From the post:

Looking for a classification system, controlled vocabulary, ontology, taxonomy, thesaurus that covers the field you are working in? The University Library of Basel in Switzerland recently launched a register containing the metadata of 600 controlled and structured vocabularies in 65 languages. Its official name: the Basel Register of Thesauri, Ontologies and Classifications (BARTOC).

High quality search

All items in BARTOC are indexed with Eurovoc, EU’s multilingual thesaurus, and classified using Dewey Decimal Classification (DDC) numbers down to the third level, allowing a high quality subject search. Other search characteristics are:

  • The search interface is available in 20 languages.
  • A Boolean operators field is integrated into the search box.
  • The advanced search allows you to refine your search by Field type, Language, DDC, Format and Access.
  • In the results page you can refine your search further by using the facets on the right side.

A great step towards bridging vocabularies but at a much higher (more general) level than any enterprise or government department.

Measuring the Evolution of Ontology Complexity:…

Tuesday, October 15th, 2013

Measuring the Evolution of Ontology Complexity: The Gene Ontology Case Study by Olivier Dameron, Charles Bettembourg, Nolwenn Le Meur. (Dameron O, Bettembourg C, Le Meur N (2013) Measuring the Evolution of Ontology Complexity: The Gene Ontology Case Study. PLoS ONE 8(10): e75993. doi:10.1371/journal.pone.0075993)

Abstract:

Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure.

The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred.

The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.

Prospective ontology authors and ontology authors need to read this paper carefully.

Over a period of only four years, the ontologies studied in this paper evolved.

Which is a good thing, because the understandings that underpinned the original ontologies changed over those four years.

The lesson here being that for all of their apparent fixity, a useful ontology is no more fixed than authors who create and maintain it and the users who use it.

At any point in time an ontology may be “fixed” for some purpose or in some view, but that is a snapshot in time, not an eternal view.

As ontologies evolve, so must the mappings that bind them with and to other ontologies.

A blind mapping, simple juxtaposition of terms from ontologies is one form of mapping. A form that makes maintenance a difficult and chancy affair.

If on the other hand, each term had properties that supported the recorded mapping, any maintainer could follow enunciated rules for maintenance of that mapping.

Blind mapping: Pay the cost of mapping every time ontology mappings become out of synchronization enough to pinch (or lead to disaster).

Sustainable mapping: Pay the full cost of mapping once and then maintain the mapping.

What’s your comfort level with risk?

  • Discovery of a “smoking gun” memo on tests of consumer products.
  • Inappropriate access to spending or financial records.
  • Preservation of inappropriate emails.
  • etc.

What are you not able to find with an unmaintained ontology?

Web Scale? Or do you want to try for human scale?

Sunday, August 4th, 2013

How often have your heard the claim this or that technology is “web scale?”

How big is “web scale?”

Visit http://www.worldwidewebsize.com/ to get an estimate of the size of the Web.

As of today, the estimated number of indexed web pages for Google is approximately 47 billion pages.

How does that compare, say to scholarly literature?

Would you believe 1 trillion pages of scholarly journal literature?

An incomplete inventory (Fig. 1), divided into biological, social, and physical sciences, contains 400, 200, and 65 billion pages, respectively (see supplemental data*).

Or better with an image:

webscale

I didn’t bother putting in the trillion page data but for your information, the indexed Web is < 5% of all scholarly journal literature.

Nor did I try to calculate the data that Chicago is collecting every day with 10,000 video cameras.

Is your app ready to step up to human scale information retrieval?

*Advancing science through mining libraries, ontologies, and communities by JA Evans, A. Rzhetsky. J Biol Chem. 2011 Jul 8;286(27):23659-66. doi: 10.1074/jbc.R110.176370. Epub 2011 May 12.

To index is to translate

Tuesday, July 30th, 2013

To index is to translate by Fran Alexander.

From the post:

Living in Montreal means I am trying to improve my very limited French and in trying to communicate with my Francophone neighbours I have become aware of a process of attempting to simplify my thoughts and express them using the limited vocabulary and grammar that I have available. I only have a few nouns, fewer verbs, and a couple of conjunctions that I can use so far and so trying to talk to people is not so much a process of thinking in English and translating that into French, as considering the basic core concepts that I need to convey and finding the simplest ways of expressing relationships. So I will say something like “The sun shone. It was big. People were happy” because I can’t properly translate “We all loved the great weather today”.

This made me realise how similar this is to the process of breaking down content into key concepts for indexing. My limited vocabulary is much like the controlled vocabulary of an indexing system, forcing me to analyse and decompose my ideas into simple components and basic relationships. This means I am doing quite well at fact-based communication, but my storytelling has suffered as I have only one very simple emotional register to work with. The best I can offer is a rather laconic style with some simple metaphors: “It was like a horror movie.”

It is regularly noted that ontology work in the sciences has forged ahead of that in the humanities, and the parallel with my ability to express facts but not tell stories struck me. When I tell my simplified stories I rely on shared understanding of a broad cultural context that provides the emotional aspect – I can use the simple expression “horror movie” because the concept has rich emotional associations, connotations, and resonances for people. The concept itself is rather vague, broad, and open to interpretation, so the shared understanding is rather thin. The opposite is true of scientific concepts, which are honed into precision and a very constrained definitive shared understanding. So, I wonder how much of sense that I can express facts well is actually an illusion, and it is just that those factual concepts have few emotional resonances.

Is mapping a process of translation?

Are translations always less rich than the source?

Or are translations as rich but differently rich?

Integrating controlled vocabularies… (webinar)

Friday, June 28th, 2013

Integrating controlled vocabularies in information management systems : the new ontology plug-in”, 4th July

From the post:

The Webinar will introduce the new ontology plug-in developed in the context of the AIMS Community, how it works and the usage possibilities. It was created within the context of AgriOcean DSpace, however it is an independent plug-in and can be used in any other applications and information management systems.

The ontology plug-in searches multiple thesauri and ontologies simultaneously by using a web service broker (e.g. AGROVOC, ASFA, Plant Ontology, NERC-C19 ontology, and OceanExpert). It delivers as output a list of selected concepts, where each concept has a URI (or unique ID), a preferred label with optional language definition and the ontology from which the concepts has been selected. The application uses JAVA, Javascript and JQuery. As it is open software, developers are invited to reuse, enrich and enhance the existing source code.

We invite the participants of the webinar to give their view how thesauri and ontologies can be used in repositories and other types of information management systems and how the ontology plug-in can be further developed.

Date

4th of July 2013 – 16:00 Rome Time (Use Time Converter to calculate the time difference between your location and Rome , Italy)

On my must watch list.

Demo: http://193.190.8.15/ontwebapp/ontology.html

Source: https://code.google.com/p/ontology-plugin/

Imagine adapting the plugin to search for URIs in <a> elements and searching a database for the subjects they identify.

Vocabulary Management at W3C (Draft) [ontology and vocabulary as synonyms]

Thursday, June 6th, 2013

Vocabulary Management at W3C (Draft)

From the webpage:

One of the major stumbling blocks in deploying RDF has been the difficulty data providers have in determining which vocabularies to use. For example, a publisher of scientific papers who wants to embed document metadata in the web pages about each paper has to make an extensive search to find the possible vocabularies and gather the data to decide which among them are appropriate for this use. Many vocabularies may already exist, but they are difficult to find; there may be more than one on the same subject area, but it is not clear which ones have a reasonable level of stability and community acceptance; or there may be none, i.e. one may have to be developed in which case it is unclear how to make the community know about the existence of such a vocabulary.

There have been several attempts to create vocabulary catalogs, indexes, etc. but none of them has gained a general acceptance and few have remained up for very long. The latest notable attempt is LOV, created and maintained by Bernard Vatant (Mondeca) and Pierre-Yves Vandenbussche (DERI) as part of the DataLift project. Other application areas have more specific, application-dependent catalogs; e.g., the HCLS community has established such application-specific “ontology portals” (vocabulary hosting and/or directory services) as NCBO and OBO. (Note that for the purposes of this document, the terms “ontology” and “vocabulary” are synonyms.) Unfortunately, many of the cataloging projects in the past relied on a specific project or some individuals and they became, more often than not, obsolete after a while.

Initially (1999-2003) W3C stayed out of this process, waiting to see if the community would sort out this issue by itself. We hoped to see the emergence of an open market for vocabularies, including development tools, reviews, catalogs, consultants, etc. When that did not emerge, we decided to begin offering ontology hosting (on www.w3.org) and we began the Ontaria project (with DARPA funding) to provide an ontology directory service. Implementation of these services was not completed, however, and project funding ended in 2005. After that, W3C took no active role until the emergence of schema.org and the eventual creation of the Web Schemas Task Force of the Semantic Web Interest Group. WSTF was created both to provide an open process for schema.org and as a general forum for people interested in developing vocabularies. At this point, we are contemplating taking a more active role supporting the vocabulary ecosystem. (emphasis added)

The W3C proposal fails to address two issues with vocabularies:

1. Vocabularies are not the origin of the meanings of terms they contain.

Awful, according to yet another master of the king’s English quoted by Fries, could only mean awe-inspiring.

But it was not so. “The real meaning of any word,” argued Fries, “must be finally determined, not by its original meaning, it source or etymology, but by the content given the word in actual practical usage…. Even a hardy purist would scarcely dare pronounce a painter’s masterpiece awful, without explanations. [The Story of Ain’t by David Skinner, HarperCollins 2012, page 47)

Vocabularies represent some community of semantic practice but that brings us to the second problem the W3C proposal ignores.

2. The meaning of terms in a vocabulary are not stable, universal nor self-evident.

The problem with most vocabularies being they have no way to signal the the context, community or other information that would help distinguish one vocabulary meaning from another.

A human reader may intuit context and other clues from a vocabulary and use those factors when comparing the vocabulary to a text.

Computers, on the other hand, know no more than they have been told.

Vocabularies need to move beyond being simple tokens and represent terms with structures that capture some of the information a human reader knows intuitively about those terms.

Otherwise vocabularies will remain mute records of some socially defined meaning, but we won’t know which ones.

EDAM: an ontology of bioinformatics operations,…

Wednesday, May 15th, 2013

EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats by Jon Ison, Matúš Kalaš, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish McWilliam, James Malone, Rodrigo Lopez, Steve Pettifer and Peter Rice. (Bioinformatics (2013) 29 (10): 1325-1332. doi: 10.1093/bioinformatics/btt113)

Abstract:

Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.

Results: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations.

Availability: The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl.

No matter how many times I read it, I just don’t get:

Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.

I will be generous and assume the authors meant “machine-processable descriptions” when I read “machine-understandable descriptions.” It is well known that machines don’t “understand” data, they simply process it according to specified instructions.

But more to the point, machines are indifferent to the type or number of descriptions they have for any subject. It might confuse a human processor to have thirty (30) different descriptions for the same subject but there has been no showing of such a limit for machines.

Every effort to produce a “comprehensive” ontology/classification/taxonomy, pick your brand of poison, has been in the face of competing and different descriptions. That is, after all, the rationale for a comprehensive …, that there are too many choices already.

The outcome of all such efforts, assuming there are N diverse descriptions is N + 1 diverse descriptions, the 1 being the current project added to existing diverse descriptions.

Enigma

Friday, May 10th, 2013

Enigma

I suppose it had to happen. With all the noise about public data sets that someone would create a startup to search them. 😉

Not a lot of detail at the site but you can sign up for a free trial.

Features:

100,000+ Public Data Sources: Access everything from import bills of lading, to aircraft ownership, lobbying activity,real estate assessments, spectrum licenses, financial filings, liens, government spending contracts and much, much more.

Augment Your Data: Get a more complete picture of investments, customers, partners, and suppliers. Discover unseen correlations between events, geographies and transactions.

API Access: Get direct access to the data sets, relational engine and NLP technologies that power Enigma.

Request Custom Data: Can’t find a data set anywhere else? Need to synthesize data from disparate sources? We are here to help.

Discover While You Work: Never miss a critical piece of information. Enigma uncovers entities in context, adding intelligence and insight to your daily workflow.

Powerful Context Filters: Our vast collection of public data sits atop a proprietary data ontology. Filter results by topics, tags and source to quickly refine and scope your query.

Focus on the Data: Immerse yourself in the details. Data is presented in its raw form, full screen and without distraction.

Curated Metadata: Source data is often unorganized and poorly documented. Our domain experts focus on sanitizing, organizing and annotating the data.

Easy Filtering: Rapidly prototype hypotheses by refining and shaping data sets in context. Filter tools allow the sorting, refining, and mathematical manipulation of data sets.

The “proprietary data ontology” jumps out at me as an obvious question. Do users get to know what the ontology is?

Not to mention the “our domain experts focus on sanitizing,….” Works for some cases, take legal research for example. Not sure that “your” experts works as well as “my” experts for less focused areas.

Looking forward to learning more about Enigma!

Does statistics have an ontology? Does it need one? (draft 2)

Tuesday, April 16th, 2013

Does statistics have an ontology? Does it need one? (draft 2) by D. Mayo.

From the post:

Chance, rational beliefs, decision, uncertainty, probability, error probabilities, truth, random sampling, resampling, opinion, expectations. These are some of the concepts we bandy about by giving various interpretations to mathematical statistics, to statistical theory, and to probabilistic models. But are they real? The question of “ontology” asks about such things, and given the “Ontology and Methodology” conference here at Virginia Tech (May 4, 5), I’d like to get your thoughts (for possible inclusion in a Mayo-Spanos presentation).* Also, please consider attending**.

Interestingly, I noticed the posts that have garnered the most comments have touched on philosophical questions of the nature of entities and processes behind statistical idealizations (e.g.,http://errorstatistics.com/2012/10/18/query/).

The post and ensuing comments offer much to consider.

From my perspective, if assumptions, ontological and otherwise, go unstated, the results opaque.

You can accept them, because they fit your prior opinion or how you wanted the results to be, or reject them as not fitting your prior opinion or desired result.