Archive for the ‘Taxonomy’ Category

Collaborative Annotation for Scientific Data Discovery and Reuse [+ A Stumbling Block]

Thursday, July 2nd, 2015

Collaborative Annotation for Scientific Data Discovery and Reuse by Kirk Borne.

From the post:

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential – humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.

Kirk goes onto say:

…it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies.

Sigh, but we know from experience that has never worked. True, we can develop more common data models, taxonomies and ontologies, but they will be in addition to the present common data models, taxonomies and ontologies. Not to mention that developing knowledge is going to lead to future common data models, taxonomies and ontologies.

If you don’t believe me, take a look at: Library of Congress Subject Headings Tentative Monthly List 07 (July 17, 2015). These subject headings have not yet been approved but they are in addition to existing subject headings.

The most recent approved list: Library of Congress Subject Headings Monthly List 05 (May 18, 2015). For approved lists going back to 1997, see: Library of Congress Subject Headings (LCSH) Approved Lists.

Unless you are working in some incredibly static and sterile field, the basic terms that are found in “common data models, taxonomies and ontologies” are going to change over time.

The only sure bet in the area of knowledge and its classification is that change is coming.

But, Kirk is right, common data models, taxonomies and ontologies are useful. So how do we make them more useful in the face of constant change?

Why not use topics to model elements/terms of common data models, taxonomies and ontologies? Which would enable user to search across such elements/terms by the properties of those topics. Possibly discovering topics that represent the same subject under a different term or element.

Imagine working on an update of a common data model, taxonomy or ontology and not having to guess at the meaning of bare elements or terms? A wealth of information, including previous elements/terms for the same subject being present at each topic.

All of the benefits that Kirk claims would accrue, plus empowering users who only know previous common data models, taxonomies and ontologies, to say nothing of easing the transition to future common data models, taxonomies and ontologies.

Knowledge isn’t static. Our methodologies for knowledge classification should be as dynamic as the knowledge we seek to classify.

Flax Clade PoC

Monday, July 14th, 2014

Flax Clade PoC by Tom Mortimer.

From the webpage:

Flax Clade PoC is a proof-of-concept open source taxonomy management and document classification system, based on Apache Solr. In its current state it should be considered pre-alpha. As open-source software you are welcome to try, use, copy and modify Clade as you like. We would love to hear any constructive suggestions you might have.

Tom Mortimer

Taxonomies and document classification

Clade taxonomies have a tree structure, with a single top-level category (e.g. in the example data, “Social Psychology”). There is no distinction between parent and child nodes (except that the former has children) and the hierachical structure of the taxonomy is completely orthogonal from the node data. The structure may be freely edited.

Each node represents a category, which is represented by a set of “keywords” (words or phrases) which should be present in a document belonging to that category. Not all the keywords have to be present – they are joined with Boolean OR rather than AND. A document may belong to multiple categories, which are ranked according to standard Solr (TF-IDF) scoring. It is also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of the documents already in the category. This feature is currently slow as it uses the standard Solr MoreLikeThis component to analyse a large number of documents. We plan to improve this for a future release by writing a custom Solr plugin.

Documents are stored in a standard Solr index and are categorised dynamically as taxonomy nodes are selected. There is currently no way of writing the categorisation results to the documents in SOLR, but see below for how to export the document categorisation to an XML or CSV file.

A very interesting project!

I am particularly interested in the dynamic categorisation when nodes are selected.

Finding needles in haystacks:…

Sunday, July 6th, 2014

Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi by Conrad L. Schoch, et al. (Database (2014) 2014 : bau061 doi: 10.1093/database/bau061).


DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi.

Database URL:

If you are interested in projects to update and correct existing databases, this is the article for you.

Fungi may not be on your regular reading list but consider one aspect of the problem described:

It is projected that there are ~400 000 fungal names already in existence. Although only 100 000 are accepted taxonomically, it still makes updates to the existing taxonomic structure a continuous task. It is also clear that these named fungi represent only a fraction of the estimated total, 1–6 million fungal species (93–95).

I would say that computer science isn’t the only discipline where “naming things” is hard.


PS: The other lesson from this paper (and many others) is that semantic accuracy is not easy nor is it cheap. Anyone who says differently is lying.

Names are not (always) useful

Monday, April 21st, 2014

PhyloCode names are not useful for phylogenetic synthesis

From the post:

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions. (emphasis added)

If you substitute “names” for “taxa” then it is easy to see my point in Plato, Shiva and A Social Graph about nodes that are “abstract concept devoid of interpretation.” There is nothing to compare.

This isn’t a new problem but a very old one that keeps being repeated.

For processing reasons it may be useful to act as though taxa (or names) are simply given. A digital or print index need not struggle to find a grounding for the terms it reports. For some purposes, that is completely unnecessary.

On the other hand, we should not forget the lack of grounding is purely a convenience for processing or other reasons. We can choose differently should an occasion merit it.

ZooKeys 50 (2010) Special Issue

Wednesday, January 29th, 2014

Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research by Lyubomir Penev, et. al.

From the editorial:

The principles of Open Access greatly facilitate dissemination of information through the Web where it is freely accessed, shared and updated in a form that is accessible to indexing and data mining engines using Web 2.0 technologies. Web 2.0 turns the taxonomic information into a global resource well beyond the taxonomic community. A significant bottleneck in naming species is the requirement by the current Codes of biological nomenclature ruling that new names and their associated descriptions must be published on paper, which can be slow, costly and render the new information difficult to find. In order to make progress in documenting the diversity of life, we must remove the publishing impediment in order to move taxonomy “from a cottage industry into a production line” (Lane et al. 2008), and to make best use of new technologies warranting the fastest and widest distribution of these new results.

In this special edition of ZooKeys we present a practical demonstration of such a process. The issue opens with a forum paper from Penev et al. (doi: 10.3897/zookeys.50.538) that presents the landscape of semantic tagging and text enhancements in taxonomy. It describes how the content of the manuscript is enriched by semantic tagging and marking up of four exemplar papers submitted to the publisher in three different ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (Stoev et al., doi: 10.3897/zookeys.50.504); (ii) generated from Scratchpads (Blagoderov et al., doi: 10.3897/zookeys.50.506 and Brake and Tschirnhaus, doi: 10.3897/zookeys.50.505); (iii) generated from an author’s database (Taekul et al., doi: 10.3897/zookeys.50.485). The latter two were submitted as XML-tagged manuscript. These examples demonstrate the suitability of the workflow to a range of possibilities that should encompass most current taxonomic efforts. To implement the aforementioned routes for XML mark up in prospective taxonomic publishing, a special software tool (Pensoft Mark Up Tool, PMT) was developed and its features were demonstrated in the current issue. The XML schema used was version #123 of TaxPub, an extension to the Document Type Definitions (DTD) of the US National Library of Medicine (NLM) (

A second forum paper from Blagoderov et al. (doi: 10.3897/zookeys.50.539) sets out a workflow that describes the assembly of elements from a Scratchpad taxon page ( to export a structured XML file. The publisher receives the submission, automatically renders the file into the journal‘s layout style as a PDF and transmits it to a selection of referees, based on the key words in the manuscript and the publisher’s database. Several steps, from the author’s decision to submit the manuscript to final publication and dissemination, are automatic. A journal editor first spends time on the submission when the referees’ reports are received, making the decision to publish, modify or reject the manuscript. If the decision is to publish, then PDF proofs are sent back to the author and, when verified, the paper is published both on paper and on-line, in PDF, HTML and XML formats. The original information is also preserved on the original Scratchpad where it may, in due course, be updated. A visitor arriving at the web site by tracing the original publication will be able to jump forward to the current version of the taxon page.

This sounds like the promise of SGML/XML made real doesn’t it?

See the rest of the editorial or ZooKeys 50 for a very good example of XML and semantics in action.

This is a long way from the “related” or “recent” article citations in most publisher interfaces. Thoughts on how to make that change?

Introducing mangal,…

Wednesday, January 8th, 2014

Introducing mangal, a database for ecological networks

From the post:

Working with data on ecological networks is usually a huge mess. Most of the time, what you have is a series of matrices with 0 and 1, and in the best cases, another file with some associated metadata. The other issue is that, simply put, data on ecological networks are hard to get. The Interaction Web Database has some, but it's not as actively maintained as it should, and the data are not standardized in any way. When you need to pull a lot of networks to compare them, it means that you need to go through a long, tedious, and error-prone process of cleaning and preparing the data. It should not be that way, and that is the particular problem I've been trying to solve since this spring.

About a year ago, I discussed why we should have a common language to represent interaction networks. So with this idea in mind, and with great feedback from colleagues, I assembled a series of JSON schemes to represent networks, in a way that will allow programmatic interaction with the data. And I'm now super glad to announce that I am looking for beta-testers, before I release the tool in a formal way. This post is the first part of a series of two or three posts, which will give informations about the project, how to interact with the database, and how to contribute data. I'll probably try to write a few use-cases, but if reading these posts inspire you, feel free to suggest some!

So what is that about?

mangal (another word for a mangrove, and a type of barbecue) is a way to represent and interact with networks in a way that is (i) relatively easy and (ii) allows for powerful analyses. It's built around a data format, i.e. a common language to represent ecological networks. You can have an overview of the data format on the website. The data format was conceived with two ideas in mind. First, it must makes sense from an ecological point of view. Second, it must be easy to use to exchange data, send them to database, and get them through APIs. Going on a website to download a text file (or an Excel one) should be a thing of the past, and the data format is built around the idea that everything should be done in a programmatic way.

Very importantly, the data specification explains how data should be formatted when they are exchanged, not when they are used. The R package, notably, uses igraph to manipulate networks. It means that anyone with a database of ecological networks can write an API to expose these data in the mangal format, and in turn, anyone can access the data with the URL of the API as the only information.

Because everyone uses R, as I've mentionned above, we are also releasing a R package (unimaginatively titled rmangal). You can get it from GitHub, and we'll see in a minute how to install it until it is released on CRAN. Most of these posts will deal with how to use the R package, and what can be done with it. Ideally, you won't need to go on the website at all to interact with the data (but just to make sure you do, the website has some nice eye-candy, with clickable maps and animated networks).

An excellent opportunity to become acquainted with the iGraph package for R (299 pages), IGraph for Python (394 pages), and iGraph C Library (812 pages).

Unfortunately, iGraph does not support multigraphs or hypergraphs.


Sunday, January 5th, 2014

taxie – taxize vignette – a taxonomic toolbelt for R

From the webpage:

taxize is a taxonomic toolbelt for R. taxize wraps APIs for a large suite of taxonomic databases availab on the web.

Tasks you can accomplish include:

  • Resolve taxonomic name
  • Retrieve higher taxonomic names
  • Interactive name selection
  • What taxa are the children of my taxon of interest?
  • Matching species tables with different taxonomic resolution

The webpage page includes links to apply for API keys (when required).

Currently implemented data sources:

I first saw this in a tweet by Antonio J. Pérez.

WBG Topical Taxonomy

Tuesday, November 26th, 2013

WBG Topical Taxonomy

From the description:

The WBG Taxonomy is a classification schema which represents the concepts used to describe the World Bank Group’s topical knowledge domains and areas of expertise – expertise – the ‘what we do’ and ‘what we know’ aspect of the Bank’s work. The WBG Taxonomy provides an enterprise-wide, application-independent framework for describing all of the Bank’s areas of expertise and knowledge domains, current as well as historical, representing the language used by domain experts and domain novices, and Bank staff and Bank clients.

Available in TriG, N-Triples, RDF/XML, Turtle.

A total of 1560 concepts.

You did hear about the JP Morgan Twitter debacle, JPMorgan humiliates itself in front of all of Twitter?

My favorite tweet (from memory) was: “Does the sleeze come off with a regular shower or does it require something special, like babys’ tears?”

In light of JP Morgan’s experience, why not ask citizens of countries with World Bank debt:

What needs to be added to the “World Bank Global Topical Taxonomy?

For example:

Budget Transparency – No content other than broader concepts.

Two others at random:

ICT and Social Accountability – No content other than broader concepts. (ICT = Information and Communication Technologies)

Rural Poverty and Livelihoods – No content other than one broader concept.

Do you think vague categories result in avoidance of accountability and corporate responsibility?

So do I.

I first saw this in a tweet by Pool Party.

A Proposed Taxonomy of Plagiarism

Thursday, November 7th, 2013

A Proposed Taxonomy of Plagiarism Or, what we talk about when we talk about plagiarism by Rick Webb.

From the post:

What with the recent Rand Paul plagiarism scandal, I’d like to propose a new taxonomy of plagiarism. Some plagiarism is worse than others, and the basic definition of plagiarism that most people learned in school is only part of it.

Chris Hayes started off his show today by referencing the Wikipedia definition of plagiarism: “the ‘wrongful appropriation’ and ‘purloining and publication’ of another author’s ‘language, thoughts, ideas, or expressions,’ and the representation of them as one’s own original work.” The important point here that most people overlook is the theft of ideas. We all learn in school that plagiarism exists if we wholesale copy and paste other people’s words. But ideas are actually a big part of it.

Interesting read but I am not sure the taxonomy is fine grained enough.

Topic maps, like any other publication, has the potential for plagiarism. But I would make plagiarism distinctions for topic maps content based upon its intended audience.

For example, if I were writing a topic map about topic maps, there would be a lot of terms and subjects which I would use, relying on the background of the audience to know they did not originate with me.

But when I moved into the first instance of an idea being proposed, etc., then I should be using more formal citation because that enables the reader to track the development of a particular idea or strategy. It would be inappropriate to talk about tolog, for example, without crediting Lars Marius Garshol with its creation and clearly distinguishing any statements about tolog as being from particular sources.

All topic map followers already know those facts but in formal writing, you should help the reader with tracking down the sources you relied upon.

Completely different case in a committee discussion of tolog, no one is going to footnote their comments and hopefully if you are participating in a discussion of tolog, you are aware of its origins.

On the Rand Paul “scandal,” I think the media reaction cheapens the notion of plagiarism.

A better response to Rand Paul (you pick the topic) would be:

[Senator Paul], what you’ve just said is one of the most insanely idiotic things I have ever heard. At no point in your rambling, incoherent response were you even close to anything that could be considered a rational thought. Everyone in this room is now dumber for having listened to it. I award you no points, and may God have mercy on your soul. (Billy Madison)

A new slogan for CNN (original): CNN: Spreading Dumbness 24X7.

Crowdsourcing Multi-Label Classification for Taxonomy Creation

Monday, November 4th, 2013

Crowdsourcing Multi-Label Classification for Taxonomy Creation by Jonathan Bragg, Mausam and Daniel S. Weld.


Recent work has introduced CASCADE, an algorithm for creating a globally-consistent taxonomy by crowdsourcing microwork from many individuals, each of whom may see only a tiny fraction of the data (Chilton et al. 2013). While CASCADE needs only unskilled labor and produces taxonomies whose quality approaches that of human experts, it uses significantly more labor than experts. This paper presents DELUGE, an improved workflow that produces taxonomies with comparable quality using significantly less crowd labor. Specifically, our method for crowdsourcing multi-label classification optimizes CASCADE’s most costly step (categorization) using less than 10% of the labor required by the original approach. DELUGE’s savings come from the use of decision theory and machine learning, which allow it to pose microtasks that aim to maximize information gain.

An extension of work reported at Cascade: Crowdsourcing Taxonomy Creation.

While the reduction in required work is interesting, the ability to sustain more complex workflows looks like the more important.

That will require the development of workflows to be optimized, at least for subject identification.

Or should I say validation of subject identification?

What workflow do you use for subject identification and/or validation of subject identification?

New Community Forums for Cloudera Customers and Users

Monday, July 29th, 2013

New Community Forums for Cloudera Customers and Users by Justin Kestelyn.

From the post:

This is a great day for technical end-users – developers, admins, analysts, and data scientists alike. Starting now, Cloudera complements its traditional mailing lists with a new, feature-rich community forums intended for users of Cloudera’s Platform for Big Data! (Login using your existing credentials or click the link to register.)

Although mailing lists have long been a standard for user interaction, and will undoubtedly continue to be, they have flaws. For example, they lack structure or taxonomy, which makes consumption difficult. Search functionality is often less than stellar and users are unable to build reputations that span an appreciable period of time. For these reasons, although they’re easy to create and manage, mailing lists inherently limit access to knowledge and hence limit adoption.

The new service brings key additions to the conversation: functionality, search, structure and scalability. It is now considerably easier to ask questions, find answers (or questions to answer), follow and share threads, and create a visible and sustainable reputation in the community. And for Cloudera customers, there’s a bonus: your questions will be escalated as bonafide support cases under certain circumstances (see below).

Another way for you to participate in the Hadoop ecosystem!

BTW, the discussion taxonomy:

What is the reasoning behind your taxonomy?

We made a sincere effort to balance the requirements of simplicity and thoroughness. Of course, we’re always open to suggestions for improvements.

I don’t doubt the sincerity of the taxonomy authors. Not one bit.

But all taxonomies represent the “intuitive” view of some small group. There is no means to escape the narrow view of all taxonomies.

What we can do, at least with topic maps, is to allow groups to have their own taxonomies and to view data through those taxonomies.

Mapping between taxonomies means that addition via any of the taxonomies results in new data appearing as appropriate in other taxonomies.

Perhaps it was necessary to champion one taxonomy when information systems were fixed, printed representations of data and access systems.

But the need for a single taxonomy, if it ever existed, does not exist now. We are free to have any number of taxonomies for any data set, visible or invisible to other users/taxonomies.

More than thirty (30) years after the invention of the personal computer, we are still laboring under the traditions of printed information systems.

Isn’t it time to move on?

Graph-based Approach to Automatic Taxonomy Generation (GraBTax)

Tuesday, July 9th, 2013

Graph-based Approach to Automatic Taxonomy Generation (GraBTax) by Pucktada Treeratpituk, Madian Khabsa, C. Lee Giles.


We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm, GraBTax, incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, GraBTax first extracts topical terms and their relationships from the corpus. The algorithm then constructs a weighted graph representing topics and their associations. A graph partitioning algorithm is then used to recursively partition the topic graph into a taxonomy. For evaluation, we apply GraBTax to articles, primarily computer science, in the CiteSeerX digital library and search engine. The quality of the resulting concept hierarchy is assessed by both human judges and comparison with Wikipedia categories.

Interesting work.

For example:

Unfortunately, existing taxonomies for concepts in computer science such as ODP categories and the ACM Classification System1 are unsuitable as a gold standard. ODP categories are too broad and do not contain the majority of concepts produced by our algorithm. For instance, there are no sub-concepts for “Semantic Web” in ODP. Also some portions of ODP categories under computer science are not computer science related concepts, especially at the lower level. For example, the concepts under “Neural Networks” are Books, People, Companies, Publications, FAQs, Help and Tutorials, etc. The ACM Classification System has similar drawbacks, where its categories are too broad for comparison.

Makes me curious if comparing the topics extracted from articles would consistently map to the broad categories assigned by the ACM.

Also instructive for the use of graphs, which admit to no pre-determined data structure.

I say that because of an on-going discussion about alternative data models for topic maps.

As you know, I don’t think topic maps have only one data model, not even my own.

The model you construct with your topic map should meet your needs, not mine.

Graphs are a good example of interchangeable information artifacts despite no one being able to constrain the graphs of others.

XML is another, although it gets overlooked from time to time.

PS: The authors don’t say but I am assuming that ODP = Open Directory Project.

Cascade: Crowdsourcing Taxonomy Creation

Tuesday, May 14th, 2013

Cascade: Crowdsourcing Taxonomy Creation by Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, James A. Landay.


Taxonomies are a useful and ubiquitous way of organizing information. However, creating organizational hierarchies is difficult because the process requires a global understanding of the objects to be categorized. Usually one is created by an individual or a small group of people working together for hours or even days. Unfortunately, this centralized approach does not work well for the large, quickly-changing datasets found on the web. Cascade is an automated workflow that creates a taxonomy from the collective efforts of crowd workers who spend as little as 20 seconds each. We evaluate Cascade and show that on three datasets its quality is 80-90% of that of experts. The cost of Cascade is competitive with expert information architects, despite taking six times more human labor. Fortunately, this labor can be parallelized such that Cascade will run in as fast as five minutes instead of hours or days.

In the introduction the authors say:

Crowdsourcing has become a popular way to solve problems that are too hard for today’s AI techniques, such as translation, linguistic tagging, and visual interpretation. Most successful crowdsourcing systems operate on problems that naturally break into small units of labor, e.g., labeling millions of independent photographs. However, taxonomy creation is much harder to decompose, because it requires a global perspective. Cascade is a unique, iterative workflow that emergently generates this global view from the distributed actions of hundreds of people working on small, local problems.

The authors demonstrate the potential for time and cost savings in the creation of taxonomies but I take the significance of their paper to be something different.

As the paper demonstrates, taxonomy creation does not require a global perspective.

Any one of the individuals who participated, contributed localized knowledge that when combined with other localized knowledge, can be formed into what an observer would call a taxonomy.

A critical point since every user represents/reflects slightly varying experiences and viewpoints, while the most learned expert represents only one.

Does “your” taxonomy reflect your views or some expert’s?

Taxonomy disaster stories

Saturday, May 4th, 2013

Taxonomy disaster stories

If you like Saturday afternoon disaster stories, Jean Graef relates hellish taxonomy stories from Patrick Lambe of Straits Knowledge.

No surprises but the stories are well told.

TaxoCoP · Taxonomy Community of Practice

Saturday, May 4th, 2013

TaxoCoP · Taxonomy Community of Practice

From the webpage:

“Taxonomies? That’s classified information…”

The taxonomy community of practice is a forum to communicate ideas, techniques and experiences in deriving, applying and maintaining taxonomies. Members include practitioners of various backgrounds and responsibilities: consultants, taxonomists, indexers, content managers, knowledge management professionals, librarians, and others.

Join us on the Taxonomy CoP wiki to contribute resources, read conference call and discussion summaries, and more!.

Yahoo! discussion group on taxonomies.

I found this group following a link on the Taxonomy CoP wiki, which Jean Graef points to in Taxonomy disaster stories.

Taxonomies Make the Law. Will Folksonomies Change It?

Wednesday, May 1st, 2013

Taxonomies Make the Law. Will Folksonomies Change It? by Serena Manzoli.

From the post:

Take a look at your bundle of tags on Delicious. Would you ever believe you’re going to change the law with a handful of them?

You’re going to change the way you research the law. The way you apply it. The way you teach it and, in doing so, shape the minds of future lawyers.

Do you think I’m going too far? Maybe.

But don’t overlook the way taxonomies have changed the law and shaped lawyers’ minds so far. Taxonomies? Yeah, taxonomies.

We, the lawyers, have used extensively taxonomies through the years; Civil lawyers in particular have shown to be particularly prone to them. We’ve used taxonomies for three reasons: to help legal research, to help memorization and teaching, and to apply the law.

Serena omits one reason lawyers use taxonomies: Knowledge of a taxonomy, particularly a complex one, confers power.

Legal taxonomies also exclude the vast majority of the population from meaningful engagement in public debates, much less decision making.

To be fair, some areas of the law are very complex, securities and tax law come to mind. Even without the taxonomy barrier, mastery is a difficult thing.

Serena’s example of navigable waters reminded me of one of my law professors who in separate cases, lost both sides of the question of navigability of a particular water way. 😉

I am hopeful that Serena is correct about the coming impact of folksonomies on the law.

But I am also mindful that legal “reform” rarely emerges from the gauntlet of privilege unscathed.

I first saw this at: Manzoli on Legal Taxonomies and Legal Folksonomies.

SharePoint Taxonomy: How to Start

Tuesday, April 2nd, 2013

SharePoint Taxonomy: How to Start

From the post:

Are you wondering how to start with SharePoint Taxonomy?

Many people have heard about the value of managed metadata, term store, and tagging in SharePoint 2010 and SharePoint 2013 but don’t have a taxonomy and are wondering what a taxonomy looks like and how to get started.

Download a free SharePoint Taxonomy from WAND and begin to see how taxonomy, managed metadata, and the term store in SharePoint can improve searching and findability of your SharePoint content. This taxonomy is a starter set of terms covering Legal, IT, HR, Accounting and Finance, and Sales and Marketing

There are more taxonomies at:

I ran across this today while thinking about the question of design patterns.

The web is littered with taxonomies, ontologies, thesauri, etc., so rather than starting over from scratch, why not cut-n-paste/adapt/represent existing structures as topic maps?

Suggestions of other sources?

Particularly ones you are interested in seeing as topic maps!

What is the difference between a Taxonomy and an Ontology?

Monday, April 1st, 2013

What is the difference between a Taxonomy and an Ontology?

From the post:

In the world of information management, two common terms that people use are “taxonomy” and “ontology” but people often wonder what the difference between the two terms are. In many of our webinars, this question comes up so I wanted to provide an answer on our blog.

When I first read this post, I thought it was an April Fool’s post. But check the date: March 15, 2013. Unless April Fool’s day came early this year.

After reading the post you will find that what the author calls a taxonomy is actually an ontology.

Don’t take my word for it, see the original post.

I think the difference between a taxonomy and an ontology is that an ontology costs more.

I don’t know of any other universal differences between the two.

I first saw this in Taxonomy or Ontology by April Holmes.

Chaotic Nihilists and Semantic Idealists [And What of Users?]

Tuesday, February 5th, 2013

Chaotic Nihilists and Semantic Idealists by Alistair Croll.

From the post:

There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.

One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.

The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.

They’re wrong.

Alistair continues with the other camp:

Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.

She’s wrong too.

And projects the future as:

The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.

Why are we missing asking users what they meant as a third option?

Depends on who you want to be in charge:

Algorithms — Empower Computer Scientists.

Ontologies/taxonomies — Empower Ontologists.

Asking Users — Empowers Users.

Topic maps are a solution that can ask users.

Any questions?

taxize: Taxonomic search and phylogeny retrieval [R]

Monday, December 17th, 2012

taxize: Taxonomic search and phylogeny retrieval by Scott Chamberlain, Eduard Szoecs and Carl Boettiger.

From the documentation:

We are developing taxize as a package to allow users to search over many websites for species names (scientific and common) and download up- and downstream taxonomic hierarchical information – and many other things. The functions in the package that hit a specific API have a prefix and suffix separated by an underscore. They follow the format of service_whatitdoes. For example, gnr_resolve uses the Global Names Resolver API to resolve species names. General functions in the package that don’t hit a specific API don’t have two words separated by an underscore, e.g., classification. You need API keys for Encyclopedia of Life (EOL), the Universal Biological Indexer and Organizer (uBio), Tropicos, and Plantminer.

Just in case you need species names and/or taxonomic hierarchy information for your topic map.

…such as the eXtensible Business Reporting Language (XBRL).

Saturday, April 28th, 2012

Now there is a shout-out! Better than Steve Cobert or Jon Steward? Possibly, possibly. 😉

Where? The DATA act, recently passed by the House of Representatives (US), reads in part:

EXISTING DATA REPORTING STANDARDS.—In designating reporting standards under this subsection, the Commission shall, to the extent practicable, incorporate existing nonproprietary standards, such as the eXtensible Business Reporting Language (XBRL). [Title 31, Section 3611(b)(3). Doesn’t really roll off the tongue does it?]

No guarantees but what do you think the odds are that XBRL will be used by the commission? (That’s what I thought.)

With that in mind:


Homepage for and apparently the starting point for all things XBRL. You will find the specifications, taxonomies, best practices and other materials on XBRL.

Enough reading material to keep you busy while waiting for organizations to adopt or to be required to adopt XBRL.

Topic maps are relevant to this transition for several reasons, among others:

  1. Some organizations will have legacy accounting systems that require mapping to XBRL.
  2. Even organizations that have transitioned to XBRL will have legacy data that has not.
  3. Transitions to XBRL by different organizations may not reflect the same underlying semantics.

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Thursday, February 23rd, 2012

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Pablo Pareja writes:

I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.

Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and taxonomy is one of them.

The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomics MG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, – I couldn’t find any applicable implementation that was thought for more than just two nodes.

Important for its use with NCBI taxonomy nodes but another use case comes readily to mind.

What about overlapping markup?

Traditionally we represent markup elements as single nodes, despite their composition of and events for each “well-formed” element in the text stream.

But what if we represent and events as nodes in a graph with relationships both to each other and other nodes in the markup stream?

Can we then ask the question, Which pair of / nodes are the ancestor of either a or element?

If they have the same ancestor then we have the uninteresting case of well-formed markup.

But what if they don’t have the same ancestor? What can the common ancestor method tell us about the structure of the markup?

Definitely a research topic.

Wordmap Taxonomy Connectors for SharePoint and Endeca (Logician/Ontologist not included or required)

Tuesday, January 31st, 2012

Wordmap Taxonomy Connectors for SharePoint and Endeca

From the post:

Wordmap SharePoint Taxonomy Connector

Integrate Wordmap taxonomies directly with Microsoft® SharePoint to classify documents as well as support SharePoint browsing and search capabilities. The SharePoint Taxonomy Connector allows you to overcome many of the difficulties of managing taxonomies inside the SharePoint environment by allowing you to use Wordmap to do all of the daily taxonomy management tasks.

Wordmap Endeca Taxonomy Connector

The Endeca® Information Access Platform thrives on robust, well-constructed and well-maintained taxonomies. Use Wordmap to do your taxonomy management and allow our Endeca Taxonomy Connector to push the taxonomy to Endeca as the foundation of the guided navigation experience. Wordmap’s Endeca Taxonomy Connector directly translates your taxonomies into the Endeca dimension.xsd format, pulled into Endeca on system start-up. The Wordmap Endeca Taxonomy Connector also allows you to leverage taxonomy in Endeca’s powerful auto-classification engine for improved content indexing.

Wordmap Taxonomy Connector Highlights:

  • No configuration needed for consuming systems,
  • Wordmap taxonomy data is sent in the preferred format of the search application for easy ingestion
  • Manage the taxonomy centrally and push out only relevant sections for indexing, navigation and search
  • Taxonomy is seamlessly integrated into the content lifecycle

Definitely a step in the right direction.

Points to consider as you plan your next topic map project:

  1. Require no configuration for consuming systems,
  2. Send in the preferred format of the search application for easy ingestion
  3. Taxonomy not managed by end users, automatically push out relevant sections for indexing, navigation and search
  4. Seamlessly integrate topic map into the content lifecycle

Interesting that “tagging” of the data is a requirement for use of this tool. I would have thought otherwise.

Any pointers to how often this has been chosen as a solution? The last entry on their news page is dated in 2009. Which may mean they don’t keep up their website very well or that they aren’t active enough to have any news.

They are owned by Earley and Associates, which does have an active website but I still didn’t see much news about Wordmap. Searching turned up some old materials but nothing really recent.

A Taxonomy of Ideas?

Saturday, January 14th, 2012

A Taxonomy of Ideas?

David McCandless writes:

Recently, when throwing ideas around with people, I’ve noticed something. There seems to be a hidden language we use when evaluating ideas.

Neat idea. Brilliant idea. Dumb idea. Bad idea. Strange idea. Cool idea.

There’s something going on here. Each one of these ideas is subtly different in character. Each adjective somehow conveys the quality of the concept in a way we instantly and unconsciously understand.

Good point. There is always a “hidden language” that will be understood by members of a social group. But that also means that “hidden language” and its implications, will not be understood, at least not in the same way, by people in other groups.

That same “hidden language” shapes our choices of subjects out of a grab bag of subjects (a particular data set if not the world).

We can name some things that influence our choices, but it is always far from being all of them. Which means that no set of rules will always lead to the choices we would make. We are incapable of specifying the rules in the require degree of detail.

Modular Unified Tagging Ontology (MUTO)

Thursday, November 17th, 2011

Modular Unified Tagging Ontology (MUTO)

From the webpage:

The Modular Unified Tagging Ontology (MUTO) is an ontology for tagging and folksonomies. It is based on a thorough review of earlier tagging ontologies and unifies core concepts in one consistent schema. It supports different forms of tagging, such as common, semantic, group, private, and automatic tagging, and is easily extensible.

I though the tagging axioms were worth repeating:

  • A tag has always exactly one label – otherwise it is not a tag.

    (Additional labels can be separately defined, e.g. via skos:Concept.)
  • Tags with the same label are not necessarily semantically identical.

    (Each tag has its own identity and property values.)
  • A tag can itself be a resource of tagging (tagging of tags).

From the properties defined, however, it isn’t clear how to determine when tags do have the same meaning and/or how to communicate that understanding to others?

Ah, or would that be a tagging of a tagging?

That sounds like it leaves a lot of semantic detail on the cutting room floor but it may be that viable semantic systems, oh, say natural languages, do exactly that. Something to think about isn’t it?

Introducing fise, the Open Source RESTful Semantic Engine

Saturday, October 22nd, 2011

Introducing fise, the Open Source RESTful Semantic Engine

From the post:

fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.

As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.

Presenting the software in Q/A form:

What is a Semantic Engine?

A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.

Current semantic engines can typically:

  • categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the  Business, Lifestyle, Technology categories? …);
  • suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
  • find related documents in the local database or on the web;
  • extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
  • detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
  • extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…

During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.

Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.


Monday, October 17th, 2011

TaxoBank: Access, deposit, save, share, and discuss taxonomy resources

From the webpage:

Welcome to the TaxoBank Terminology Registry

The TaxoBank contains information about controlled vocabularies of all types and complexities. We invite you to both browse and contribute. Enjoy term lists for special purpose use, get ideas for building your own vocabulary, perhaps find one that can give you a quicker start.

The information collected about each vocabulary follows a study (TRSS) conducted by JISC, the Joint Information Systems Committee of the Higher and Further Education Funding Councils. All of the recommended fields included in the study’s final report are included; some of those the study identified as Optional are not. See more about the Terminology Registry Scoping Study (TRSS) at their site. In addition, input from other information experts was elicited in planning the site.

This is an interactive web site. To add information about a vocabulary, click on Create Content in the left navigation pane (you’ll need to register as a user first; we just need your name and email). There are only eight required fields, but your listing will be more useful if you complete all the applicable fields about your vocabulary.

Add a comment to almost any page – how you’ve used the vocabulary, what you’d add to it, how you’d use it if expanded to an ontology, etc. Comments are welcome on Event and Blog pages as well. Click on Add Comment, and enter your thoughts. Even anonymous visitors (not signed in) can add comments, but they’ll be reviewed by a site admin before they’re made visible.

You may also update the Events section of the site. Taxonomy, Knowledge Systems, Information Architecture or Management, Metadata are all appropriate event themes. Click on Create Content and then on Events to add a new one (you’ll need to be a registered user).

Contact us through the Contact page, with suggestions, corrections, or to discuss displaying your vocabulary on this site (particularly important if it was created on a college server and faces erasure at the end of the academic year), or if you have questions.

Thank you for visiting (and participating)!

The “Vocabulary spotlight” suggested “Thesaurus of BellyDancing” on my first visit.

To be honest, I had never thought about belly dancing having a thesaurus or even a standard vocabulary for its description.

For class: Browse the listing and pick out an entry for a subject area unfamiliar to you. Prepare a short, say less than 5 minute oral review of the entry. What did you like/dislike, find useful, less than useful, etc. Did any thing about the entry interest you in finding out more about the subject matter or its treatment?

A Visual Taxonomy Of Every Chocolate Candy, Ever

Sunday, October 16th, 2011

A Visual Taxonomy Of Every Chocolate Candy, Ever

Just a reminder that information science need not be bland or even tasteless.

Not to mention this is a very clever use of visual layout to convey fairly complex information.

I got here via Chocolate as a Teaching Tool, at TaxoDiary. How could a title like that fail to catch your attention? 😉

The same source has The Very, Very, Many Varieties of Beer. It only lists 300 but I thought Lar Marius might find it useful in plotting a course to taste every brew on the planet. Would make a nice prize for the 2012 Balisage Conference.

Query processing in distributed, taxonomy-based information sources

Tuesday, September 13th, 2011

Query processing in distributed, taxonomy-based information sources by Carlo Meghini, Yannis Tzitzikas, Veronica Coltella, and Anastasia Analyti.


We address the problem of answering queries over a distributed information system, storing objects indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efficient in data complexity. In the second part of the paper, we consider the distributed case, presenting alternative ways implementing the centralized algorithm. These ways descend from two basic criteria: direct vs. query re-writing evaluation, and centralized vs. distributed data or taxonomy allocation. Combinations of these criteria allow to cover a wide spectrum of architectures, ranging from client-server to peer-to-peer. We evaluate the performance of the various architectures by simulation on a network with O(10^4) nodes, and derive final results. An extensive review of the relevant literature is finally included.

Two quick comments:

While simulations are informative, I am curious how the five architectures would fare against actual taxonomies? Thinking that the complexity at any particular level varies greatly from taxonomy to taxonomy, assuming they are taxonomies that record natural phenomena.

Second, I think there is a growing recognition that while some data can be successfully gathered to a single location for processing, there is an increasing amount of data that may be partially accessible but that cannot be transfered for privacy, security or other concerns. And such diverse systems are likely to have their own means of identifying subjects.