Archive for the ‘Thesaurus’ Category

What’s New for 2016 MeSH

Thursday, December 17th, 2015

What’s New for 2016 MeSH by Jacque-Lynne Schulman.

From the post:

MeSH is the National Library of Medicine controlled vocabulary thesaurus which is updated annually. NLM uses the MeSH thesaurus to index articles from thousands of biomedical journals for the MEDLINE/PubMed database and for the cataloging of books, documents, and audiovisuals acquired by the Library.

MeSH experts/users will need to absorb the details but some of the changes include:

Overview of Vocabulary Development and Changes for 2016 MeSH

  • 438 Descriptors added
  • 17 Descriptor terms replaced with more up-to-date terminology
  • 9 Descriptors deleted
  • 1 Qualifier (Subheading) deleted


MeSH Tree Changes: Uncle vs. Nephew Project

In the past, MeSH headings were loosely organized in trees and could appear in multiple locations depending upon the importance and specificity. In some cases the heading would appear two or more times in the same tree at higher and lower levels. This arrangement led to some headings appearing as a sibling (uncle) next to the heading under which they were treed as a nephew. In other cases a heading was included at a top level so it could be seen more readily in printed material. We reviewed these headings in MeSH and removed either the Uncle or Nephew depending upon the judgement of our Internal and External reviewers. There were over 1,000 tree changes resulting from this work, many of which will affect search retrieval in MEDLINE/PubMed and the NLM Catalog.


MeSH Scope Notes

MeSH had a policy that each descriptor should have a scope note regardless of how obvious its meaning. There were many legacy headings that were created without scope notes before this rule came into effect. This year we initiated a project to write scope notes for all existing headings. Thus far 481 scope notes to MeSH were added and the project continues for 2017 MeSH.

Echoes of Heraclitus:

It is not possible to step twice into the same river according to Heraclitus, or to come into contact twice with a mortal being in the same state. (Plutarch) (Heraclitus)

Semantics and the words we use to invoke them are always in a state of flux. Sometimes more, sometimes less.

The lesson here is that anyone who says you can have a fixed and stable vocabulary is not only selling something, they are selling you a broken something. If not broken on the day you start to use it, then fairly soon thereafter.

It took time for me to come to the realization that the same is true about information systems that attempt to capture changing semantics at any given point.

Topic maps in the sense of ISO 13250-2, for example, can capture and map changing semantics, but if and only if you are willing to accept its data model.

Which is good as far as it goes but what if I want a different data model? That is to still capture changing semantics and map between them, but using a different data model.

We may have a use case to map back to ISO 13250-2 or to some other data model. The point being that we should not privilege any data model or syntax in advance, at least not absolutely.

Not only do communities change but their preferences for technologies change as well. It seems just a bit odd to be selling an approach on the basis of capturing change only to build a dike to prevent change in your implementation.


Getty Thesaurus of Geographic Names (TGN)

Friday, August 22nd, 2014

Getty Thesaurus of Geographic Names Released as Linked Open Data by James Cuno.

From the post:

We’re delighted to announce that the Getty Research Institute has released the Getty Thesaurus of Geographic Names (TGN)® as Linked Open Data. This represents an important step in the Getty’s ongoing work to make our knowledge resources freely available to all.

Following the release of the Art & Architecture Thesaurus (AAT)® in February, TGN is now the second of the four Getty vocabularies to be made entirely free to download, share, and modify. Both data sets are available for download at under an Open Data Commons Attribution License (ODC BY 1.0).

What Is TGN?

The Getty Thesaurus of Geographic Names is a resource of over 2,000,000 names of current and historical places, including cities, archaeological sites, nations, and physical features. It focuses mainly on places relevant to art, architecture, archaeology, art conservation, and related fields.

TGN is powerful for humanities research because of its linkages to the three other Getty vocabularies—the Union List of Artist Names, the Art & Architecture Thesaurus, and the Cultural Objects Name Authority. Together the vocabularies provide a suite of research resources covering a vast range of places, makers, objects, and artistic concepts. The work of three decades, the Getty vocabularies are living resources that continue to grow and improve.

Because they serve as standard references for cataloguing, the Getty vocabularies are also the conduits through which data published by museums, archives, libraries, and other cultural institutions can find and connect to each other.

A resource where you could loose some serious time!

Try this entry for London.

Or Paris.

Bear in mind the data that underlies this rich display is now available for free downloading.

CAB Thesaurus 2014

Wednesday, August 6th, 2014

CAB Thesaurus 2014

From the webpage:

The CAB Thesaurus is the essential search tool for all users of the CAB ABSTRACTS™ and Global Health databases and related products. The CAB Thesaurus is not only an invaluable aid for database users but it has many potential uses by individuals and organizations indexing their own information resources for both internal use and on the Internet.

Its strengths include:

  • Controlled vocabulary that has been in constant use since 1983
  • Regularly updated (current version released July 2014)
  • Broad coverage of pure and applied life sciences, technology and social sciences
  • Approximately 264,500 terms, including 144,900 preferred terms and 119,600 non-preferred terms
  • Specific terminology for all subjects covered
  • Includes about 206,400 plant, animal and microorganism names
  • Broad, narrow and related terms to help users find relevant terminology
  • Cross-references from non-preferred synonyms to preferred terms
  • Multi-lingual, with Dutch, Portuguese and Spanish equivalents for most English terms, plus lesser content in Danish, Finnish, French, German, Italian, Norwegian and Swedish
  • American and British spelling variants
  • Relevant CAS registry numbers for chemicals
  • Commission notation for enzymes

Impressive work and one that you should consult before venturing out to make a “standard” vocabulary for some area. It may already exist.

As a traditional thesaurus, CAB lists equivalent terms in other languages. That is to say it omits any properties of its primary or “matching” terms to enable the reader to judge for themselves if the terms represent the same subject.

When you become accustomed to thinking of what criteria was used to say two or more words represent the same subject, the lack of that information becomes glaring.

I first saw this at New edition of CAB Thesaurus published by Anton Doroszenko.

Getty Art & Architecture Thesaurus Now Available

Saturday, February 22nd, 2014

Art & Architecture Thesaurus Now Available as Linked Open Data by James Cuno.

From the post:

We’re delighted to announce that today, the Getty has released the Art & Architecture Thesaurus (AAT)® as Linked Open Data. The data set is available for download at under an Open Data Commons Attribution License (ODC BY 1.0).

The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques. It’s one of the Getty Research Institute’s four Getty Vocabularies, a collection of databases that serves as the premier resource for cultural heritage terms, artists’ names, and geographical information, reflecting over 30 years of collaborative scholarship. The other three Getty Vocabularies will be released as Linked Open Data over the coming 18 months.

In recent months the Getty has launched the Open Content Program, which makes thousands of images of works of art available for download, and the Virtual Library, offering free online access to hundreds of Getty Publications backlist titles. Today’s release, another collaborative project between our scholars and technologists, is the next step in our goal to make our art and research resources as accessible as possible.

What’s Next

Over the next 18 months, the Research Institute’s other three Getty Vocabularies—The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®—will all become available as Linked Open Data. To follow the progress of the Linked Open Data project at the Research Institute, see their page here.

A couple of points of particular interest:

Getty documentation says this is the first industrial application of ISO 25964 Information and documentation – Thesauri and interoperability with other vocabularies..

You will probably want to read AAT Semantic Representation rather carefully.

A great source of data and interesting reading on the infrastructure as well.

I first saw this in a tweet by Semantic Web Company.

SKOSsy – Thesauri on the fly!

Tuesday, January 14th, 2014

SKOSsy – Thesauri on the fly!

From the webpage:

SKOSsy extracts data from LOD sources like DBpedia (and basically from any RDF based knowledge base you like) and works well for automatic text mining and whenever a seed thesaurus should be generated for a certain domain, organisation or a project.

If automatically generated thesauri are loaded into an editor like PoolParty Thesaurus Manager (PPT) you can start to enrich the knowledge model by additional concepts, relations and links to other LOD sources. With SKOSsy, thesaurus projects you don´t have to be started in the open countryside anymore. See also how SKOSsy is integrated into PPT.

  • SKOSsy makes heavy use of Linked Data sources, especially DBpedia
  • SKOSsy can generate SKOS thesauri for virtually any domain within a few minutes
  • Such thesauri can be improved, curated and extended to one´s individual needs but they serve usually as “good-enough” knowledge models for any semantic search application you like
  • SKOSsy thesauri serve as a basis for domain specific text extraction and knowledge enrichment
  • SKOSsy based semantic search usually outperform search algorithms based on pure statistics since they contain high-quality information about relations, labels and disambiguation
  • SKOSsy works perfectly together with PoolParty product family

DBpedia is probably closer to some user’s vocabulary than most formal ones. 😉

I have the sense that rather than asking experts for their semantics (and how to represent them), we are about to turn to users to ask about their semantics (and choose simple ways to represent them).

If results that are useful to the average user are the goal, it is a move in the right direction.

Experimenting with visualisation tools

Thursday, November 21st, 2013

Experimenting with visualisation tools by Brian Aitken.

From the post:

Over the past few months I’ve been working to develop some interactive visualisations that will eventually be made available on the Mapping Metaphor website. The project team investigated a variety of visualisation approaches that they considered well suited to both the project data and the connections between the data, and they also identified a number of toolkits that could be used to generate such visualisations.

Brian experiments with the JavaScript InfoVis Toolkit for the Mapping Metaphor with the Historical Thesaurus project.

Interesting read. Promises to cover D3 in a future post.

Could be very useful for other graph or topic map visualizations.

BARTOC launched : A register for vocabularies

Friday, November 15th, 2013

BARTOC launched : A register for vocabularies by Sarah Dister

From the post:

Looking for a classification system, controlled vocabulary, ontology, taxonomy, thesaurus that covers the field you are working in? The University Library of Basel in Switzerland recently launched a register containing the metadata of 600 controlled and structured vocabularies in 65 languages. Its official name: the Basel Register of Thesauri, Ontologies and Classifications (BARTOC).

High quality search

All items in BARTOC are indexed with Eurovoc, EU’s multilingual thesaurus, and classified using Dewey Decimal Classification (DDC) numbers down to the third level, allowing a high quality subject search. Other search characteristics are:

  • The search interface is available in 20 languages.
  • A Boolean operators field is integrated into the search box.
  • The advanced search allows you to refine your search by Field type, Language, DDC, Format and Access.
  • In the results page you can refine your search further by using the facets on the right side.

A great step towards bridging vocabularies but at a much higher (more general) level than any enterprise or government department.

The Historical Thesaurus of English

Thursday, November 14th, 2013

The Historical Thesaurus of English

From the webpage:

The Historical Thesaurus of English project was initiated by the late Professor Michael Samuels in 1965 and completed in 2008. It contains almost 800,000 word meanings from Old English onwards, arranged in detailed hierarchies within broad conceptual categories such as Thought or Music. It is based on the second edition of the Oxford English Dictionary and its Supplements, with additional materials from A Thesaurus of Old English, and was published in print as the Historical Thesaurus of the OED by Oxford University Press on 22 October 2009.

This electronic version enables users to pinpoint the range of meanings of a word throughout its history, their synonyms, and their relationship to words of more general or more specific meaning. In addition to providing hitherto unavailable information for linguistic and textual scholars, the Historical Thesaurus online is a rich resource for students of social and cultural history, showing how concepts developed through the words that refer to them. Links to Oxford English Dictionary headwords are provided for subscribers to the online OED, which also links the two projects on its own site.

Take particular note of:

This electronic version enables users to pinpoint the range of meanings of a word throughout its history, their synonyms, and their relationship to words of more general or more specific meaning.

Ooooh, that means that words don’t have fixed meanings. Or that everyone reads them the same way.

Want to improve your enterprise search results? A maintained domain/enterprise specific thesaurus would be a step in that direction.

Not to mention a thesaurus could reduce the 42% of people who use the wrong information to make decisions to a lesser number. (Findability As Value Proposition)

Unless you are happy with the 60/40 Rule, where 40% of your executives are making decisions based on incorrect information.

I wouldn’t be.

Mapping Metaphor with the Historical Thesaurus

Monday, June 24th, 2013

Mapping Metaphor with the Historical Thesaurus: Visualization of Links

From the post:

By the end of the Mapping Metaphor with the Historical Thesaurus project we will have a web resource which allows the user to find pathways into our data. It will show a map of the conceptual metaphors of English over the last thousand years, showing links between each semantic area where we find evidence of metaphorical overlap. Unsurprisingly, given the visual and spatial metaphors which we are necessarily already using to describe our data and the analysis of it (e.g pathways and maps), this will be represented graphically as well as in more traditional forms.

Below is a very early (in the project) example of a visualisation of the semantic domains of ‘Light’ and ‘Darkness, absence of light’, showing their metaphorical links with other semantic areas in the Historical Thesaurus data. We produced this using the program Gephi, which allows links between nodes to be shown using different colours, thickness of lines, etc.

Light and Darkness

From the project description at University of Glasgow, School of Critical Studies:

Over the past 30 years, it has become clear that metaphor is not simply a literary phenomenon; metaphorical thinking underlies the way we make sense of the world conceptually. When we talk about ‘a healthy economy’ or ‘a clear argument’ we are using expressions that imply the mapping of one domain of experience (e.g. medicine, sight) onto another (e.g. finance, perception). When we describe an argument in terms of warfare or destruction (‘he demolished my case’), we may be saying something about the society we live in. The study of metaphor is therefore of vital interest to scholars in many fields, including linguists and psychologists, as well as to scholars of literature.

Key questions about metaphor remain to be answered; for example, how did metaphors arise? Which domains of experience are most prominent in metaphorical expressions? How have the metaphors available in English developed over the centuries in response to social changes? With the completion of the Historical Thesaurus, published as the Historical Thesaurus of the Oxford English Dictionary by OUP (Kay, Roberts, Samuels, Wotherspoon eds, 2009), we can begin to address these questions comprehensively and in detail for the first time. We now have the opportunity to track how metaphorical ways of thinking and expressing ourselves have changed over more than a millennium.

Almost half a century in the making, the Historical Thesaurus is the first source in the world to offer a comprehensive semantic classification of the words forming the written record of a language. In the case of English, this record covers thirteen centuries of change and development, in metaphor as in other areas. We will use the Historical Thesaurus evidence base to investigate how the language of one domain of experience (e.g. medicine) contributes to others (e.g. finance). As we proceed, we will be able to see innovations in metaphorical thinking at particular periods or in particular areas of experience, such as the Renaissance, the scientific revolution, and the early days of psychoanalysis.

To achieve our goals, we will devise tools for the analysis of metaphor historically, beginning with a systematic identification of instances where words extend their meanings from one domain into another. An annotated ‘Metaphor Map’, which will be freely available online, will allow us to demonstrate when and how significant shifts in meaning took place. On the basis of this evidence, the team will produce series of case studies and a book examining key domains of metaphorical meaning.

Conference papers from the project.

What a wickedly topic map-like idea!

Welcome to the Unified Astronomy Thesaurus!

Thursday, January 31st, 2013

Welcome to the Unified Astronomy Thesaurus!

From the webpage:

The Unified Astronomy Thesaurus (UAT) will be an open, interoperable and community-supported thesaurus which unifies the existing divergent and isolated Astronomy & Astrophysics thesauri into a single high-quality, freely-available open thesaurus formalizing astronomical concepts and their inter-relationships. The UAT builds upon the existing IAU Thesaurus with major contributions from the Astronomy portions of the thesauri developed by the Institute of Physics Publishing and the American Institute of Physics. We expect that the Unified Astronomy Thesaurus will be further enhanced and updated through a collaborative effort involving broad community participation.

While the AAS has assumed formal ownership of the UAT, the work will be available under a Creative Commons License, ensuring its widest use while protecting the intellectual property of the contributors. We envision that development and maintenance will be stewarded by a broad group of parties having a direct stake in it. This includes professional associations (IVOA, IAU), learned societies (AAS, RAS), publishers (IOP, AIP), librarians and other curators working for major astronomy institutes and data archives.

The main impetus behind the creation of a single thesaurus has been the wish to support semantic enrichment of the literature, but we expect that use of the UAT (along with other vocabularies and ontologies currently being developed in our community) will be much broader and will have a positive impact on the discovery of a wide range of astronomy resources, including data products and services.

Several thesauri are listed as resources at this site.

Certainly would make an interesting topic map project.

I first saw this at: Science Reference: A New Thesaurus Created for the Astronomy Community by Gary Price.

Upcoming release of EuroVoc 4.4, EU’s multilingual thesaurus [December 18, 2012]

Wednesday, December 12th, 2012

Upcoming release of EuroVoc 4.4, EU’s multilingual thesaurus

From the post:

EuroVoc 4.4 will be released on December 18, 2012. During this day, the website might be temporary unavailable.

6.883 thesaurus concepts

This new edition is the result of a thorough revision among other things according to the concepts introduced by the ‘Lisbon Treaty’. It includes 6.883 thesaurus concepts of which 85 concepts are new, 142 have been updated and 28 have been classified as obsolete concepts.

These new concepts are the results of the proposals sent by the librarians from the libraries of the national parliaments in Europe, the European Institutions namely the European Parliament and the users of EuroVoc. All the terms in Portuguese have been revised according to the Portuguese language spelling reform. The prior lexical value remains available as Non-Preferred Terms.

EuroVoc, the EU’s multilingual thesaurus

EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU, the European Parliament in particular. It contains terms in 22 EU languages. It is managed by the Publications Office, which moved forward to ontology-based thesaurus management and semantic web technologies conformant to W3C recommendations as well as latest trends in thesaurus standards.

There are documents prior to this version of the thesaurus and even documents prior to there being a EuroVoc thesaurus at all.

And there will be documents after EuroVoc has been superceded.

Not to mention in between there will be documents that use other vocabularies.

Good thing we have topic maps to use this resource to its best advantage.

A way station in a sea of semantic currents and drifts.

Thesauri (Vocabularies – TemaTres)

Saturday, August 18th, 2012

Thesauri (Vocabularies – TemaTres)

The TemaTres vocabulary server is important but even more so is its collection of one hundred and fifty vocabularies.

Send a note if you export your vocabulary to a topic map. Interested in examples of mappings between vocabularies.

TemaTres: the open source vocabulary server

Saturday, August 18th, 2012

TemaTres: the open source vocabulary server

From the webpage:

This is the International site for examples and cases on TemaTres, an open source vocabulary server for manage controlled vocabularies, taxonomies and thesaurus.

In this site you can found some resources about tools for knowledge management on digital spaces, TemaTres examples and some hosted vocabularies.

Quick link:

Said to export to:

Skos-Core, Zthes, TopicMap, Dublin Core, MADS, BS8723-5, RSS, SiteMap, txt

Looking at the documentation now.

Separate post coming on vocabularies at this site.

I first saw this at Beyond Search.

Iowa Government Gets a Digital Dictionary Provided By Access

Monday, April 9th, 2012

Iowa Government Gets a Digital Dictionary Provided By Access

Whitney Grace writes:

How did we get by without the invention of the quick search to look up information? We used to use dictionaries, encyclopedias, and a place called the library. Access Innovations, Inc. has brought the Iowa Legislature General Assembly into the twenty-first century.

The write-up “Access Innovations, Inc. Creates Taxonomy for Iowa Code, Administrative Code and Acts” tells us the data management industry leader has built a thesaurus that allows the Legislature to search its library of proposed laws, bills, acts, and regulations. Users can also add their unstructured data to the thesaurus. Access used their Data Harmony software to provide subscription-based delivery and they built the thesaurus on MAIstro.

Sounds very much like a topic map-like project doesn’t it? Will be following up for more details.

ISO 25964-­-1 Thesauri for information retrieval

Friday, January 20th, 2012

Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval

Actually that is the homepage for Networked Knowledge Organization Systems/Services – N K O S but the lead announcement item is for ISO 25964-1, etc.

From that webpage:

New international thesaurus standard published

ISO 25964-­-1 is the new international standard for thesauri, replacing ISO 2788 and ISO 5964. The full title is Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval. As well as covering monolingual and multilingual thesauri, it addresses 21st century needs for data sharing, networking and interoperability.

Content includes:

  • construction of mono-­- and multi-­-lingual thesauri;
  • clarification of the distinction between terms and concepts, and their inter-­-relationships;
  • guidance on facet analysis and layout;
  • guidance on the use of thesauri in computerized and networked systems;
  • best practice for the management and maintenance of thesaurus development;
  • guidelines for thesaurus management software;
  • a data model for monolingual and multilingual thesauri;
  • brief recommendations for exchange formats and protocols.

An XML schema for data exchange has been derived from the data model, and is available free of charge at . Coming next ISO 25964-­-1 is the first of two publications. Part 2: Interoperability with other vocabularies is in the public review stage and will be available by the end of 2012.

Find out how you can obtain a copy from the news release.

Let me help you there, the correct number is: ISO 25964-1:2011 and the list price for a PDF copy is CHF 238,00, or in US currency (today), $257.66 (for 152 pages).

Shows what I know about semantic interoperability.

If you want semantic interoperability, you change people $1.69 per page (152 pages) for access to the principles of thesauri to be used for information retrieval.

ISO/IEC and JTC 1 are all parts of a system of viable international (read non-vendor dominated) organizations for information/data standards. They are the natural homes for the management of data integration standards that transcend temporal, organizational, governmental and even national boundaries.

But those roles will not fall to them by default. They must seize the initiative and those roles. Clinging to old-style publishing models for support makes them appear timid in the face of current challenges.

Even vendors recognize their inability to create level playing fields for technology/information standards. And the benefits that come to vendors from de jure as well as non-de jure standards organizations.

ISO/IEC/JTC1, provided they take the initiative, can provide an international, de jure home for standards that form the basis for information retrieval and integration.

The first step to take is to make ISO/IEC/JTC1 information standards publicly available by default.

The second step is to call up all members and beneficiaries, both direct and indirect, of ISO/IEC/JTC 1 work, to assist in the creation of mechanisms to support the vital roles played by ISO/IEC/JTC 1 as de jure standards bodies.

We can all learn something from ISO 25964-1 but how many of us will with that sticker price?

Graph Words: A Free Visual Thesaurus of the English Language

Monday, October 31st, 2011

Graph Words: A Free Visual Thesaurus of the English Language

From the post:

One of the very first examples of visualization that succeeds in merging beauty with function is Visual Thesaurus, a subscription-based online thesaurus and dictionary that shows the relationships between words through a beautiful interactive map.

The idea behind Graph Words [] is quite similar, though the service can be used completely free of charge.

Based on the knowledge captured in WordNet, a large lexical database of the English language, Graph Words is an interactive English dictionary and thesaurus that helps one find the meanings of words by revealing their connections among associated words. Any resulting network graph can also be stored as images.

I particularly liked “…helps one find the meanings of words by revealing their connections among associated words.”

I would argue that words only have meaning in the context of associated words. The unfortunate invention of the modern dictionary falsely portrays words as being independent of their context.

The standard Arabic dictionary, Lisan al-‘Arab (roughly, “The Arab Tongue”), was reported by my Arabic professor to be very difficult to use because the entries consisted of poetry and prose selections that illustrated the use of words in context. You have to be conversant to use the dictionary but that would be one of the reasons for using it, to become conversant. 😉 Both Lisan al-‘Arab (about 20,000 pages) and Lane’s Arabic-English Lexicon (about 8,000+ pages) are online now.

Geological Survey Austria launches thesaurus project

Tuesday, October 18th, 2011

Geological Survey Austria launches thesaurus project by Helmut Nagy.

From the post:

Throughout the last year the Semantic Web Company team has supported the Geological Survey of Austria (GBA) in setting up their thesaurusA thesaurus is a book that lists words grouped together according to similarity of meaning, in contrast to a dictionary, which contains definitions and pronunciations. The largest thesaurus in the world is the Historical Thesaurus of the Oxford English Dictionary, which contains more than … project. It started with a workshop in summer 2010 where we discussed use cases for using semantic web technologies as means to fulfill the INSPIRE directive. Now in fall 2011 GBA published their first thesauri as Linked Data using PoolParty’s new Linked Data front-end.

The Thesaurus Project of the GBA aims to create controlled vocabularies for the semantic harmonization of map-based geodata. The content-related realization of this project is governed by the Thesaurus Editorial Team, which consists of domain experts from the Geological Survey of Austria. With the development of semantically and technically interoperable geo-data the Geological Survey of Austria implements its legal obligation defined by the EU-Directive 2007/2/EC INSPIRE and the national “Geodateninfrastrukturgesetz” (GeoDIG), respectively.

I wonder if their “controlled vocabularies” are going to map to the terminology used over the history of Europe, in maps, art, accounts, histories, and other recorded materials?

If not, I wonder if there would be any support to tie that history into current efforts or do they plan on simply cutting off the historical record and starting with their new thesaurus?

Networked Knowledge Organization Systems/Services NKOS

Monday, October 17th, 2011

Networked Knowledge Organization Systems/Services NKOS

From the website:

NKOS is devoted to the discussion of the functional and data model for enabling knowledge organization systems/services (KOS), such as classification systems, thesauri, gazetteers, and ontologies, as networked interactive information services to support the description and retrieval of diverse information resources through the Internet.

Knowledge Organization Systems/Services (KOS) model the underlying semantic structure of a domain. Embodied as Web-based services, they can facilitate resource discovery and retrieval. They act as semantic road maps and make possible a common orientation by indexers and future users (whether human or machine). — Douglas Tudhope, Traugott Koch, New Applications of Knowledge Organization Systems

A wide variety of resources that will interest anyone working with knowledge systems. I would expect any number of these to appear in future posts with comments or observations.


Monday, October 17th, 2011

TaxoBank: Access, deposit, save, share, and discuss taxonomy resources

From the webpage:

Welcome to the TaxoBank Terminology Registry

The TaxoBank contains information about controlled vocabularies of all types and complexities. We invite you to both browse and contribute. Enjoy term lists for special purpose use, get ideas for building your own vocabulary, perhaps find one that can give you a quicker start.

The information collected about each vocabulary follows a study (TRSS) conducted by JISC, the Joint Information Systems Committee of the Higher and Further Education Funding Councils. All of the recommended fields included in the study’s final report are included; some of those the study identified as Optional are not. See more about the Terminology Registry Scoping Study (TRSS) at their site. In addition, input from other information experts was elicited in planning the site.

This is an interactive web site. To add information about a vocabulary, click on Create Content in the left navigation pane (you’ll need to register as a user first; we just need your name and email). There are only eight required fields, but your listing will be more useful if you complete all the applicable fields about your vocabulary.

Add a comment to almost any page – how you’ve used the vocabulary, what you’d add to it, how you’d use it if expanded to an ontology, etc. Comments are welcome on Event and Blog pages as well. Click on Add Comment, and enter your thoughts. Even anonymous visitors (not signed in) can add comments, but they’ll be reviewed by a site admin before they’re made visible.

You may also update the Events section of the site. Taxonomy, Knowledge Systems, Information Architecture or Management, Metadata are all appropriate event themes. Click on Create Content and then on Events to add a new one (you’ll need to be a registered user).

Contact us through the Contact page, with suggestions, corrections, or to discuss displaying your vocabulary on this site (particularly important if it was created on a college server and faces erasure at the end of the academic year), or if you have questions.

Thank you for visiting (and participating)!

The “Vocabulary spotlight” suggested “Thesaurus of BellyDancing” on my first visit.

To be honest, I had never thought about belly dancing having a thesaurus or even a standard vocabulary for its description.

For class: Browse the listing and pick out an entry for a subject area unfamiliar to you. Prepare a short, say less than 5 minute oral review of the entry. What did you like/dislike, find useful, less than useful, etc. Did any thing about the entry interest you in finding out more about the subject matter or its treatment?

CENDI Agency Terminology Resources

Monday, October 17th, 2011

CENDI Agency Terminology Resources

From the webpage:

The following URLs provide access to the online thesauri and indexing resources of the various federal scientific & technical agencies including CENDI agencies. These resources are of interest to those wishing to know about the scientific and technical terminology used in various fields.

  • Agriculture & Food
  • Applied Science & Technologies
  • Astronomy & Space
  • Biology & Nature
  • Earth & Ocean Sciences
  • Energy & Energy Conservation
  • Environment & Environmental Quality
  • General Science
  • Health & Medicine
  • Physics, Chemistry, and Mathematics
  • Science Education

I will post on CENDI but I thought this was important enough to call out separately. Particularly since there are multiple thesauri in some of these categories.

For example:

NAL Agricultural Thesaurus

The NAL Agricultural Thesaurus (NALT) is annually updated and the 2007 edition contains over 65,800 terms organized into 17 subject categories. NALT is searchable online and is available in several formats (PDF, ASCII text, XML, SKOS) for download from the web site. NALT has standard hierarchical, equivalence and associative relationships and provides scope notes and over 2,400 definitions of terms for clarity. Proposals for new terminology can be sent to Published by the National Agricultural Library, United States Department of Agriculture.

Tesauro Agrícola

Tesauro Agrícola is the Spanish language translation of the NAL Agricultural Thesaurus (NALT). The thesaurus accommodates the complexity of the Spanish language from a Western Hemisphere perspective. First published in May 2007, the thesaurus contains over 15,700 translated concepts and contains definitions for more than 2,400 terms. The thesaurus is searchable with a Spanish interface and is available in several formats (PDF, ASCII text, XML) for download from the web site. Proposals for new terminology can be sent to . Published by the National Agricultural Library, United States Department of Agriculture.

Project ISO 25964-1 Thesauri and interoperability with other vocabularies

Sunday, October 16th, 2011

Project ISO 25964-1 Thesauri and interoperability with other vocabularies

From the webpage:

This is an international standard development project of ISO Technical Committee 46 (Information and documentation) Subcommittee 9 (Identification and description). The assigned Working Group (known as ISO TC46/SC9/WG8) is revising, merging, and extending two existing international standards: ISO 2788 and ISO 5964. The end product is a new standard—ISO 25964, Information and documentation – Thesauri and interoperability with other vocabularies—supporting the development and application of thesauri in today’s expanding context of networking opportunities. It is being published in two parts, as follows:

ISO 25964, Thesauri and interoperability with other vocabularies

  • Part 1: Thesauri for information retrieval
  • Part 2: Interoperability with other vocabularies

Part 1 was published in August, 2011 and Part 2 is due to appear by the end of 2011.

Unless you have $332 (US) burning a hole in your pocket, you probably want to visit: Format for Exchange of Thesaurus Data Conforming to ISO 25964-1, which has the XML schema plus documentation, etc., await for your use.

I am very interested in how they handled interoperability in part 2.

CENDI Agency Indexing System Descriptions: A Baseline Report

Sunday, October 16th, 2011

CENDI Agency Indexing System Descriptions: A Baseline Report (1998)

In some ways a bit dated but also a snap-shot in time of the indexing practices of the:

  • National Technical Information Service (NTIS),
  • Department of Energy, Office of Scientific and Technical Information (DOE OSTI),
  • US Geological Survey/Biological Resources Division (USGS/BRD),
  • National Aeronautics and Space Administration, STI Program (NASA),
  • National Library of Medicine/National Institutes of Health (NLM),
  • National Air Intelligence Center (NAIC),
  • Defense Technical Information Center (DTIC).

The summary reads:

Software/technology identification for automatic support to indexing. As the resources for providing human indexing become more precious, agencies are looking for technology support. DTIC, NASA, and NAIC already have systems in place to supply candidate terms. New systems are under development and are being tested at NAIC and NLM. The aim of these systems is to decrease the burden of work borne by indexers.

Training and personnel issues related to combining cataloging and indexing functions. DTIC and NASA have combined the indexing and cataloging functions. This reduces the paper handling and the number of “stations” in the workflow. The need for a separate cataloging function decreases with the advent of EDMS systems and the scanning of documents with some automatic generation of cataloging information based on this scanning. However, the merger of these two diverse functions has been a challenge, particularly given the difference in skill level of the incumbents.

Thesaurus maintenance software. Thesaurus management software is key to the successful development and maintenance of controlled vocabularies. NASA has rewritten its system internally for a client/server environment. DTIC has replaced its systems with a commercial-off-the-shelf product. NTIS and USGS/BRD are interested in obtaining software that would support development of more structured vocabularies.

Linked or multi-domain thesauri. Both NTIS and USGS/BRD are interested in this approach. NTIS has been using separate thesauri for the main topics of the document. USGS/BRD is developing a controlled vocabulary to support metadata creation and searching but does not want to develop a vocabulary from scratch. In both cases, there is concern about the resources for development and maintenance of an agency-specific thesaurus. Being able to link to multiple thesauri that are maintained by their individual “owners” would reduce the investment and development time.

Full-text search engines and human indexing requirements. It is clear that the explosion of information on the web (both relevant web sites and web-published documents) cannot be indexed in the old way. There are not enough resources; yet, the chaos of the web bets for more subject organization. The view of current full-text search engines is that the users often miss relevant documents and retrieve a lot of “noise”. The future of web searching is unclear and demands or requirements that it might place on indexing is unknown.

Quality Control in a production environment. As resources decrease and timeliness becomes more important, there are fewer resources available for quality control of the records. The aim is to build the quality in at the beginning, when the documents are being indexed, rather than add review cycles. However, it is difficult to maintain quality in this environment.

Training time. The agencies face indexer turnover and the need to produce at ever-increasing rates. Training time has been shortened over the years. There is a need to determine how to make shorter training periods more effective.

Indexing systems designed for new environments, especially distributed indexing. An alternative to centralized indexers is a more distributed environment that can take advantage of cottage labor and contract employees. However, this puts increasing demands on the indexing system. It must be remotely accessible, yet secure. It must provide equivalent levels of validation and up-front quality control.

Major project: Update this report, focusing on the issues listed in the summary.


Tuesday, May 3rd, 2011


From the website:

PoolParty is a thesaurus management system and a SKOS editor for the Semantic Web including text mining and linked data capabilities. The system helps to build and maintain multilingual thesauri providing an easy-to-use interface. PoolParty server provides semantic services to integrate semantic search or recommender systems into enterprise systems like CMS, web shops, CRM or Wikis.

I encountered PoolParty in the video Pool Party – Semantic Search.

The video elides over a lot of difficulties but what effective advertising doesn’t?

Curious if anyone is familiar with this group/product?

Update: 31 May 2011

Slides: Pool Party – Semantic Search

Nice slide deck on semantic search issues.

Designing a thesaurus-based comparison search interface for linked cultural heritage sources

Sunday, October 3rd, 2010

Designing a thesaurus-based comparison search interface for linked cultural heritage sources Authors: Alia Amin, Michiel Hildebrand, Jacco van Ossenbruggen, Lynda Hardman Keywords: comparison search, thesauri, cultural heritage

Prototype: LISA,


Comparison search is an information seeking task where a user examines individual items or sets of items for similarities and differences. While this is a known information need among experts and knowledge workers, appropriate tools are not available. In this paper, we discuss comparison search in the cultural heritage domain, a domain characterized by large, rich and heterogeneous data sets, where different organizations deploy different schemata and terminologies to describe their artifacts. This diversity makes meaningful comparison difficult. We developed a thesaurus-based comparison search application called LISA, a tool that allows a user to search, select and compare sets of artifacts. Different visualizations allow users to use different comparison strategies to cope with the underlying heterogeneous data and the complexity of the search tasks. We conducted two user studies. A preliminary study identifies the problems experts face while performing comparison search tasks. A second user study examines the effectiveness of LISA in helping to solve comparison search tasks. The main contribution of this paper is to establish design guidelines for the data and interface of a comparison search application. Moreover, we offer insights into when thesauri and metadata are appropriate for use in such applications.

User-centric project that develops an interface into heterogeneous data sets.

What I would characterize as pre-mapping, that is no “canonical” mapping has yet been established.

Perhaps a good idea to preserve a pre-mapping stage as any mapping represents but one choice among many.

Prescriptive vs. Adaptive Information Retrieval?

Friday, August 13th, 2010

Gary W. Strong and M. Carl Drott, contend in A Thesaurus for End-User Indexing and Retrieval, Information Processing & Management, Vol. 22, No. 6, pp. 487-492, 1986, that:

A low-cost, practical information retrieval system, if it were to be designed, would require a thesaurus, but one in which end-users would be able to browse research topics by means of an organization that is concept-based rather than term-based as is the typical thesaurus.

…. (while elsewhere)

It is our hypothesis that, when the thesaurus can be envisioned by users as a simple, yet meaningful, organization of concepts, the entire information system is much more likely to be useable in an efficient manner by novice users. (emphasis added)

It puzzles me that experts are building a system of concepts for novices to use. Do you suspect experts have different views of the domains in question than novices? And approach their search for information with different assumptions?

Any concept system designed by an expert is a prescriptive information retrieval system. It represents their view of the domain and not that of a novice. Or rather it represents how the expert thinks a novice should navigate the field.

While the expert’s view may be useful for some purposes, such as socializing a novice into a particular view of the domain, it may be more useful for novices to use a novice’s view of the domain. To build that we would need to turn to novices in a domain. Perhaps through the use of adaptive information retrieval, IR that adapts to its user, rather than the other way around.

Adaptive information retrieval systems, I like that, ones that grow to be more like their users and less like their builders with every use.

ERIC – A Resource For Topic Maps Design and Research

Sunday, March 7th, 2010

ERIC – Education Resources Information Center offers free access to > 1.3 million bibliographic records on education related materials. Thousands of new records are added every month.

Education is communication and I can’t think of a better general category for topic maps than communication. There is no one size fits all subject identification and no one way to communicate with all users. Clever use of resources found through ERIC may help avoid old mistakes so that we can make new ones.

The thesaurus feature of ERIC is very topic map like. Entries are indexed under a set of uniform “descriptors” so you can locate records indexed by subject, regardless of the terminology the author may have used. I should say that topic maps are very thesaurus like to be completely accurate.

The ERIC data and thesaurus are freely available. Exploring the “triggers” that lead to assignment of “descriptors,” creating rules for merging “descriptors” with similar mechanisms in other data sets, or the advantages of associations would make good topic map research projects. Education is a current topic of public concern and dare I say funding?

Defusing A Combinatorial Explosion?

Thursday, March 4th, 2010

One of the oldest canards in the “map to a common identifier/model” game is the allegation of a lurking combinatorial explosion.

It goes something like this: If you have identifiers A, B, and C, for a single subject, there are mappings to and from each identifier, hence:

  • A -> B
  • A -> C
  • B -> A
  • B -> C
  • C -> A
  • C -> B

Since no identifier maps to itself, the number of mappings is given by N * (N-1).

To avoid the overhead of tracking an ever increasing number of mappings between identifiers, the “cure” is to map all identifiers for a subject to a single identifier.

If something doesn’t feel right, congratulations! You’re right, something isn’t right. As Sam Hunting observed when I forwarded a draft of this post to him, if the mapping argument were true, it would not be possible to construct a thesaurus.

But it is possible to construct a thesaurus. So how to reconcile both the observed mappings and the existence of thesauri? The trick is that a thesaurus has only implicit mappings between identifiers for a subject. That assumption is glossed over in the combinatorial explosion argument. If mappings between the identifiers is left implied, the potential combinatorial explosion is defused. (You could also say the identifiers are collocated with each other. A term I will return to in other posts.)

Multiple identifiers for a subject lead to convenience and ease of use, not combinatorial explosions.