Collaborative Annotation for Scientific Data Discovery and Reuse [+ A Stumbling Block]

Collaborative Annotation for Scientific Data Discovery and Reuse by Kirk Borne.

From the post:

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential – humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.

Kirk goes onto say:

…it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies.

Sigh, but we know from experience that has never worked. True, we can develop more common data models, taxonomies and ontologies, but they will be in addition to the present common data models, taxonomies and ontologies. Not to mention that developing knowledge is going to lead to future common data models, taxonomies and ontologies.

If you don’t believe me, take a look at: Library of Congress Subject Headings Tentative Monthly List 07 (July 17, 2015). These subject headings have not yet been approved but they are in addition to existing subject headings.

The most recent approved list: Library of Congress Subject Headings Monthly List 05 (May 18, 2015). For approved lists going back to 1997, see: Library of Congress Subject Headings (LCSH) Approved Lists.

Unless you are working in some incredibly static and sterile field, the basic terms that are found in “common data models, taxonomies and ontologies” are going to change over time.

The only sure bet in the area of knowledge and its classification is that change is coming.

But, Kirk is right, common data models, taxonomies and ontologies are useful. So how do we make them more useful in the face of constant change?

Why not use topics to model elements/terms of common data models, taxonomies and ontologies? Which would enable user to search across such elements/terms by the properties of those topics. Possibly discovering topics that represent the same subject under a different term or element.

Imagine working on an update of a common data model, taxonomy or ontology and not having to guess at the meaning of bare elements or terms? A wealth of information, including previous elements/terms for the same subject being present at each topic.

All of the benefits that Kirk claims would accrue, plus empowering users who only know previous common data models, taxonomies and ontologies, to say nothing of easing the transition to future common data models, taxonomies and ontologies.

Knowledge isn’t static. Our methodologies for knowledge classification should be as dynamic as the knowledge we seek to classify.

Comments are closed.