Archive for the ‘Taxonomy’ Category

Cascade: Crowdsourcing Taxonomy Creation

Tuesday, May 14th, 2013

Cascade: Crowdsourcing Taxonomy Creation by Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, James A. Landay.

Abstract:

Taxonomies are a useful and ubiquitous way of organizing information. However, creating organizational hierarchies is difficult because the process requires a global understanding of the objects to be categorized. Usually one is created by an individual or a small group of people working together for hours or even days. Unfortunately, this centralized approach does not work well for the large, quickly-changing datasets found on the web. Cascade is an automated workflow that creates a taxonomy from the collective efforts of crowd workers who spend as little as 20 seconds each. We evaluate Cascade and show that on three datasets its quality is 80-90% of that of experts. The cost of Cascade is competitive with expert information architects, despite taking six times more human labor. Fortunately, this labor can be parallelized such that Cascade will run in as fast as five minutes instead of hours or days.

In the introduction the authors say:

Crowdsourcing has become a popular way to solve problems that are too hard for today’s AI techniques, such as translation, linguistic tagging, and visual interpretation. Most successful crowdsourcing systems operate on problems that naturally break into small units of labor, e.g., labeling millions of independent photographs. However, taxonomy creation is much harder to decompose, because it requires a global perspective. Cascade is a unique, iterative workflow that emergently generates this global view from the distributed actions of hundreds of people working on small, local problems.

The authors demonstrate the potential for time and cost savings in the creation of taxonomies but I take the significance of their paper to be something different.

As the paper demonstrates, taxonomy creation does not require a global perspective.

Any one of the individuals who participated, contributed localized knowledge that when combined with other localized knowledge, can be formed into what an observer would call a taxonomy.

A critical point since every user represents/reflects slightly varying experiences and viewpoints, while the most learned expert represents only one.

Does “your” taxonomy reflect your views or some expert’s?

Taxonomy disaster stories

Saturday, May 4th, 2013

Taxonomy disaster stories

If you like Saturday afternoon disaster stories, Jean Graef relates hellish taxonomy stories from Patrick Lambe of Straits Knowledge.

No surprises but the stories are well told.

TaxoCoP · Taxonomy Community of Practice

Saturday, May 4th, 2013

TaxoCoP · Taxonomy Community of Practice

From the webpage:

“Taxonomies? That’s classified information…”

The taxonomy community of practice is a forum to communicate ideas, techniques and experiences in deriving, applying and maintaining taxonomies. Members include practitioners of various backgrounds and responsibilities: consultants, taxonomists, indexers, content managers, knowledge management professionals, librarians, and others.

Join us on the Taxonomy CoP wiki to contribute resources, read conference call and discussion summaries, and more!.

Yahoo! discussion group on taxonomies.

I found this group following a link on the Taxonomy CoP wiki, which Jean Graef points to in Taxonomy disaster stories.

Taxonomies Make the Law. Will Folksonomies Change It?

Wednesday, May 1st, 2013

Taxonomies Make the Law. Will Folksonomies Change It? by Serena Manzoli.

From the post:

Take a look at your bundle of tags on Delicious. Would you ever believe you’re going to change the law with a handful of them?

You’re going to change the way you research the law. The way you apply it. The way you teach it and, in doing so, shape the minds of future lawyers.

Do you think I’m going too far? Maybe.

But don’t overlook the way taxonomies have changed the law and shaped lawyers’ minds so far. Taxonomies? Yeah, taxonomies.

We, the lawyers, have used extensively taxonomies through the years; Civil lawyers in particular have shown to be particularly prone to them. We’ve used taxonomies for three reasons: to help legal research, to help memorization and teaching, and to apply the law.

Serena omits one reason lawyers use taxonomies: Knowledge of a taxonomy, particularly a complex one, confers power.

Legal taxonomies also exclude the vast majority of the population from meaningful engagement in public debates, much less decision making.

To be fair, some areas of the law are very complex, securities and tax law come to mind. Even without the taxonomy barrier, mastery is a difficult thing.

Serena’s example of navigable waters reminded me of one of my law professors who in separate cases, lost both sides of the question of navigability of a particular water way. ;-)

I am hopeful that Serena is correct about the coming impact of folksonomies on the law.

But I am also mindful that legal “reform” rarely emerges from the gauntlet of privilege unscathed.

I first saw this at: Manzoli on Legal Taxonomies and Legal Folksonomies.

SharePoint Taxonomy: How to Start

Tuesday, April 2nd, 2013

SharePoint Taxonomy: How to Start

From the post:

Are you wondering how to start with SharePoint Taxonomy?

Many people have heard about the value of managed metadata, term store, and tagging in SharePoint 2010 and SharePoint 2013 but don’t have a taxonomy and are wondering what a taxonomy looks like and how to get started.

Download a free SharePoint Taxonomy from WAND and begin to see how taxonomy, managed metadata, and the term store in SharePoint can improve searching and findability of your SharePoint content. This taxonomy is a starter set of terms covering Legal, IT, HR, Accounting and Finance, and Sales and Marketing

There are more taxonomies at: http://blog.wandinc.com/p/sharepoint-2010-2013-and-online.html.

I ran across this today while thinking about the question of design patterns.

The web is littered with taxonomies, ontologies, thesauri, etc., so rather than starting over from scratch, why not cut-n-paste/adapt/represent existing structures as topic maps?

Suggestions of other sources?

Particularly ones you are interested in seeing as topic maps!

What is the difference between a Taxonomy and an Ontology?

Monday, April 1st, 2013

What is the difference between a Taxonomy and an Ontology?

From the post:

In the world of information management, two common terms that people use are “taxonomy” and “ontology” but people often wonder what the difference between the two terms are. In many of our webinars, this question comes up so I wanted to provide an answer on our blog.

When I first read this post, I thought it was an April Fool’s post. But check the date: March 15, 2013. Unless April Fool’s day came early this year.

After reading the post you will find that what the author calls a taxonomy is actually an ontology.

Don’t take my word for it, see the original post.

I think the difference between a taxonomy and an ontology is that an ontology costs more.

I don’t know of any other universal differences between the two.

I first saw this in Taxonomy or Ontology by April Holmes.

Chaotic Nihilists and Semantic Idealists [And What of Users?]

Tuesday, February 5th, 2013

Chaotic Nihilists and Semantic Idealists by Alistair Croll.

From the post:

There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.

One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.

The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.

They’re wrong.

Alistair continues with the other camp:

Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.

She’s wrong too.

And projects the future as:

The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.

Why are we missing asking users what they meant as a third option?

Depends on who you want to be in charge:

Algorithms — Empower Computer Scientists.

Ontologies/taxonomies — Empower Ontologists.

Asking Users — Empowers Users.

Topic maps are a solution that can ask users.

Any questions?

taxize: Taxonomic search and phylogeny retrieval [R]

Monday, December 17th, 2012

taxize: Taxonomic search and phylogeny retrieval by Scott Chamberlain, Eduard Szoecs and Carl Boettiger.

From the documentation:

We are developing taxize as a package to allow users to search over many websites for species names (scientific and common) and download up- and downstream taxonomic hierarchical information – and many other things. The functions in the package that hit a specific API have a prefix and suffix separated by an underscore. They follow the format of service_whatitdoes. For example, gnr_resolve uses the Global Names Resolver API to resolve species names. General functions in the package that don’t hit a specific API don’t have two words separated by an underscore, e.g., classification. You need API keys for Encyclopedia of Life (EOL), the Universal Biological Indexer and Organizer (uBio), Tropicos, and Plantminer.

Just in case you need species names and/or taxonomic hierarchy information for your topic map.

…such as the eXtensible Business Reporting Language (XBRL).

Saturday, April 28th, 2012

Now there is a shout-out! Better than Steve Cobert or Jon Steward? Possibly, possibly. ;-)

Where? The DATA act, recently passed by the House of Representatives (US), reads in part:

EXISTING DATA REPORTING STANDARDS.—In designating reporting standards under this subsection, the Commission shall, to the extent practicable, incorporate existing nonproprietary standards, such as the eXtensible Business Reporting Language (XBRL). [Title 31, Section 3611(b)(3). Doesn't really roll off the tongue does it?]

No guarantees but what do you think the odds are that XBRL will be used by the commission? (That’s what I thought.)

With that in mind:

XBRL

Homepage for XBRL.org and apparently the starting point for all things XBRL. You will find the specifications, taxonomies, best practices and other materials on XBRL.

Enough reading material to keep you busy while waiting for organizations to adopt or to be required to adopt XBRL.

Topic maps are relevant to this transition for several reasons, among others:

  1. Some organizations will have legacy accounting systems that require mapping to XBRL.
  2. Even organizations that have transitioned to XBRL will have legacy data that has not.
  3. Transitions to XBRL by different organizations may not reflect the same underlying semantics.

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Thursday, February 23rd, 2012

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Pablo Pareja writes:

I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.

Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and taxonomy is one of them.

The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomics MG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, – I couldn’t find any applicable implementation that was thought for more than just two nodes.

Important for its use with NCBI taxonomy nodes but another use case comes readily to mind.

What about overlapping markup?

Traditionally we represent markup elements as single nodes, despite their composition of and events for each “well-formed” element in the text stream.

But what if we represent and events as nodes in a graph with relationships both to each other and other nodes in the markup stream?

Can we then ask the question, Which pair of / nodes are the ancestor of either a or element?

If they have the same ancestor then we have the uninteresting case of well-formed markup.

But what if they don’t have the same ancestor? What can the common ancestor method tell us about the structure of the markup?

Definitely a research topic.

Wordmap Taxonomy Connectors for SharePoint and Endeca (Logician/Ontologist not included or required)

Tuesday, January 31st, 2012

Wordmap Taxonomy Connectors for SharePoint and Endeca

From the post:

Wordmap SharePoint Taxonomy Connector

Integrate Wordmap taxonomies directly with Microsoft® SharePoint to classify documents as well as support SharePoint browsing and search capabilities. The SharePoint Taxonomy Connector allows you to overcome many of the difficulties of managing taxonomies inside the SharePoint environment by allowing you to use Wordmap to do all of the daily taxonomy management tasks.

Wordmap Endeca Taxonomy Connector

The Endeca® Information Access Platform thrives on robust, well-constructed and well-maintained taxonomies. Use Wordmap to do your taxonomy management and allow our Endeca Taxonomy Connector to push the taxonomy to Endeca as the foundation of the guided navigation experience. Wordmap’s Endeca Taxonomy Connector directly translates your taxonomies into the Endeca dimension.xsd format, pulled into Endeca on system start-up. The Wordmap Endeca Taxonomy Connector also allows you to leverage taxonomy in Endeca’s powerful auto-classification engine for improved content indexing.

Wordmap Taxonomy Connector Highlights:

  • No configuration needed for consuming systems,
  • Wordmap taxonomy data is sent in the preferred format of the search application for easy ingestion
  • Manage the taxonomy centrally and push out only relevant sections for indexing, navigation and search
  • Taxonomy is seamlessly integrated into the content lifecycle

Definitely a step in the right direction.

Points to consider as you plan your next topic map project:

  1. Require no configuration for consuming systems,
  2. Send in the preferred format of the search application for easy ingestion
  3. Taxonomy not managed by end users, automatically push out relevant sections for indexing, navigation and search
  4. Seamlessly integrate topic map into the content lifecycle

Interesting that “tagging” of the data is a requirement for use of this tool. I would have thought otherwise.

Any pointers to how often this has been chosen as a solution? The last entry on their news page is dated in 2009. Which may mean they don’t keep up their website very well or that they aren’t active enough to have any news.

They are owned by Earley and Associates, which does have an active website but I still didn’t see much news about Wordmap. Searching turned up some old materials but nothing really recent.

A Taxonomy of Ideas?

Saturday, January 14th, 2012

A Taxonomy of Ideas?

David McCandless writes:

Recently, when throwing ideas around with people, I’ve noticed something. There seems to be a hidden language we use when evaluating ideas.

Neat idea. Brilliant idea. Dumb idea. Bad idea. Strange idea. Cool idea.

There’s something going on here. Each one of these ideas is subtly different in character. Each adjective somehow conveys the quality of the concept in a way we instantly and unconsciously understand.

Good point. There is always a “hidden language” that will be understood by members of a social group. But that also means that “hidden language” and its implications, will not be understood, at least not in the same way, by people in other groups.

That same “hidden language” shapes our choices of subjects out of a grab bag of subjects (a particular data set if not the world).

We can name some things that influence our choices, but it is always far from being all of them. Which means that no set of rules will always lead to the choices we would make. We are incapable of specifying the rules in the require degree of detail.

Modular Unified Tagging Ontology (MUTO)

Thursday, November 17th, 2011

Modular Unified Tagging Ontology (MUTO)

From the webpage:

The Modular Unified Tagging Ontology (MUTO) is an ontology for tagging and folksonomies. It is based on a thorough review of earlier tagging ontologies and unifies core concepts in one consistent schema. It supports different forms of tagging, such as common, semantic, group, private, and automatic tagging, and is easily extensible.

I though the tagging axioms were worth repeating:

  • A tag has always exactly one label – otherwise it is not a tag.

    (Additional labels can be separately defined, e.g. via skos:Concept.)
  • Tags with the same label are not necessarily semantically identical.

    (Each tag has its own identity and property values.)
  • A tag can itself be a resource of tagging (tagging of tags).

From the properties defined, however, it isn’t clear how to determine when tags do have the same meaning and/or how to communicate that understanding to others?

Ah, or would that be a tagging of a tagging?

That sounds like it leaves a lot of semantic detail on the cutting room floor but it may be that viable semantic systems, oh, say natural languages, do exactly that. Something to think about isn’t it?

Introducing fise, the Open Source RESTful Semantic Engine

Saturday, October 22nd, 2011

Introducing fise, the Open Source RESTful Semantic Engine

From the post:

fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.

As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.

Presenting the software in Q/A form:

What is a Semantic Engine?

A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.

Current semantic engines can typically:

  • categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the  Business, Lifestyle, Technology categories? …);
  • suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
  • find related documents in the local database or on the web;
  • extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
  • detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
  • extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…

During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.

Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.

TaxoBank

Monday, October 17th, 2011

TaxoBank: Access, deposit, save, share, and discuss taxonomy resources

From the webpage:

Welcome to the TaxoBank Terminology Registry

The TaxoBank contains information about controlled vocabularies of all types and complexities. We invite you to both browse and contribute. Enjoy term lists for special purpose use, get ideas for building your own vocabulary, perhaps find one that can give you a quicker start.

The information collected about each vocabulary follows a study (TRSS) conducted by JISC, the Joint Information Systems Committee of the Higher and Further Education Funding Councils. All of the recommended fields included in the study’s final report are included; some of those the study identified as Optional are not. See more about the Terminology Registry Scoping Study (TRSS) at their site. In addition, input from other information experts was elicited in planning the site.

This is an interactive web site. To add information about a vocabulary, click on Create Content in the left navigation pane (you’ll need to register as a user first; we just need your name and email). There are only eight required fields, but your listing will be more useful if you complete all the applicable fields about your vocabulary.

Add a comment to almost any page – how you’ve used the vocabulary, what you’d add to it, how you’d use it if expanded to an ontology, etc. Comments are welcome on Event and Blog pages as well. Click on Add Comment, and enter your thoughts. Even anonymous visitors (not signed in) can add comments, but they’ll be reviewed by a site admin before they’re made visible.

You may also update the Events section of the site. Taxonomy, Knowledge Systems, Information Architecture or Management, Metadata are all appropriate event themes. Click on Create Content and then on Events to add a new one (you’ll need to be a registered user).

Contact us through the Contact page, with suggestions, corrections, or to discuss displaying your vocabulary on this site (particularly important if it was created on a college server and faces erasure at the end of the academic year), or if you have questions.

Thank you for visiting (and participating)!

The “Vocabulary spotlight” suggested “Thesaurus of BellyDancing” on my first visit.

To be honest, I had never thought about belly dancing having a thesaurus or even a standard vocabulary for its description.

For class: Browse the listing and pick out an entry for a subject area unfamiliar to you. Prepare a short, say less than 5 minute oral review of the entry. What did you like/dislike, find useful, less than useful, etc. Did any thing about the entry interest you in finding out more about the subject matter or its treatment?

A Visual Taxonomy Of Every Chocolate Candy, Ever

Sunday, October 16th, 2011

A Visual Taxonomy Of Every Chocolate Candy, Ever

Just a reminder that information science need not be bland or even tasteless.

Not to mention this is a very clever use of visual layout to convey fairly complex information.

I got here via Chocolate as a Teaching Tool, at TaxoDiary. How could a title like that fail to catch your attention? ;-)

The same source has The Very, Very, Many Varieties of Beer. It only lists 300 but I thought Lar Marius might find it useful in plotting a course to taste every brew on the planet. Would make a nice prize for the 2012 Balisage Conference.

Query processing in distributed, taxonomy-based information sources

Tuesday, September 13th, 2011

Query processing in distributed, taxonomy-based information sources by Carlo Meghini, Yannis Tzitzikas, Veronica Coltella, and Anastasia Analyti.

Abstract:

We address the problem of answering queries over a distributed information system, storing objects indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efficient in data complexity. In the second part of the paper, we consider the distributed case, presenting alternative ways implementing the centralized algorithm. These ways descend from two basic criteria: direct vs. query re-writing evaluation, and centralized vs. distributed data or taxonomy allocation. Combinations of these criteria allow to cover a wide spectrum of architectures, ranging from client-server to peer-to-peer. We evaluate the performance of the various architectures by simulation on a network with O(10^4) nodes, and derive final results. An extensive review of the relevant literature is finally included.

Two quick comments:

While simulations are informative, I am curious how the five architectures would fare against actual taxonomies? Thinking that the complexity at any particular level varies greatly from taxonomy to taxonomy, assuming they are taxonomies that record natural phenomena.

Second, I think there is a growing recognition that while some data can be successfully gathered to a single location for processing, there is an increasing amount of data that may be partially accessible but that cannot be transfered for privacy, security or other concerns. And such diverse systems are likely to have their own means of identifying subjects.

The International Foundation for Information Technology (IF4IT)

Wednesday, August 24th, 2011

The International Foundation for Information Technology (IF4IT)

The Foundation has released:

A Glossary Taxonomy that provides a hierarchy of Glossaries, Terms and Definitions that are semantically grouped by relevant domain area.

A File Plan Taxonomy that specifically correlates with the previously published Records Management Taxonomy and the Records Taxonomy.

A Service Taxonomy that covers the majority of all enterprise and IT services.

A Software Taxonomy that itemizes the many different categories of enterprise and IT software.

It may be easier than coming up with your own taxonomy.

The twenty-four (24), yes, twenty-four, social media options (including email) on every page, reminded me that one “killer” semantic web/topic map app would be to create a common interface to all of those. Would need to include set intersection for the contacts on the various services. And manage the identify of contacts across the services.