Archive for the ‘Vocabularies’ Category

Threat Assessment Glossary

Wednesday, April 24th, 2013

Threat Assessment Glossary by Denise Bulling and Mario Scalora.

If you are working in the public/national security area, you may need some vocabulary help.

I would check the definitions against other sources.

Here’s why:

Hunters (AKA Biters) Hunters are individuals who intend to follow a path toward violence and behave in ways to further that goal

I’m sure the NRA will like that one.

Identification Thoughts of the necessity and utility of violence by a subject that are made evident through behaviors such as researching previous attackers and collecting, practicing, and fantasizing about weapons

That looks like a typo but I can’t tell where it should go.

Terrorism Act of violence or threats of violence used to further the agenda of the perpetrator while causing fear and psychological distress

I would have included physical harm but I’m no expert on terrorism.

Construction of Controlled Vocabularies

Tuesday, April 2nd, 2013

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

Example:
salinity/saltiness
  VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

Example:
Mercury (planet)
  Mercury (metal)
  Mercury (automobile)
  Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

Data Catalog Vocabulary (DCAT) [Last Call ends 08 April 2013]

Tuesday, March 12th, 2013

Data Catalog Vocabulary (DCAT)

Abstract:

DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. This document defines the schema and provides examples for its use.

By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites. Aggregated DCAT metadata can serve as a manifest file to facilitate digital preservation.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

Core Public Service Vocabulary released for public review [Deadline 27 February 2013]

Thursday, February 14th, 2013

Core Public Service Vocabulary released for public review

From the post:

The Core Public Service Vocabulary has entered in public review period. Anyone interested is invited to provide feedback until 27 February 2013 (inclusive).

In December 2012, the ISA Programme launched the Core Public Service Vocabulary (CPSV) initiative as part of Action 1.1 on improving semantic interoperability in European e-Government systems. The CPSV is a simplified, reusable and extensible data model that captures the fundamental characteristics of a service offered by public administrations.

The CPSV is designed to make it easy to exchange basic information about the functions carried out by the public sector and the services in which those functions are carried out. By using the vocabulary, organisations publishing data about their services will for example enable:

  • easier discovery of those services within and across countries;
  • easier discovery of the legislation and policies that underpin service provision;
  • easier comparison of similar services provided by different organisations.

Download the draft specification and comment by 27 February 2013.

From text at the draft download site, it appears the Pubic Review Period was to run from 8 February and 27 February 2013.

Take a look and see if you think that is enough time? Or to see if you have other comments.

AGROVOC 2013 edition released

Monday, February 11th, 2013

AGROVOC 2013 edition released

From the post:

The AGROVOC Team is pleased to announce the release of the AGROVOC 2013 edition.

The updated version contains 32,188 concepts in up to 22 languages, resulting in a total of 626,211 terms (in 2012: 32,061 concepts, 625,096 terms).

Please explore AGROVOC by searching terms, or browsing hierarchies.

AGROVOC 2013 is available for download, and accessible via web services.

From the “about” page:

The AGROVOC thesaurus contains 32,188 concepts in up to 22 languages covering topics related to food, nutrition, agriculture, fisheries, forestry, environment and other related domains.

A global community of editors consisting of librarians, terminologists, information managers and software developers, maintain AGROVOC using VocBench, an open-source multilingual, web-based vocabulary editor and workflow management tool that allows simultaneous, distributed editing. AGROVOC is expressed in Simple Knowledge Organization System (SKOS) and published as Linked Data.

Need some seeds for your topic map in “…food, nutrition, agriculture, fisheries, forestry, environment and other related domains”?

EU – Law-Related Authority Files

Friday, January 11th, 2013

The EU Data Portal has a number of law-related authority files:

I first saw these at: New EU Data Portal links to several law-related authority files.

Machine Learning based Vocabulary Management Tool

Monday, January 7th, 2013

Machine Learning based Vocabulary Management Tool – Assessment for the Linked Open Data by Ahsan Morshed and Ritaban Dutta.

Abstract:

Reusing domain vocabularies in the context of developing the knowledge based Linked Open data system is the most important discipline on the web. Many editors are available for developing and managing the vocabularies or Ontologies. However, selecting the most relevant editor is very difficult since each vocabulary construction initiative requires its own budget, time, resources. In this paper a novel unsupervised machine learning based comparative assessment mechanism has been proposed for selecting the most relevant editor. Defined evaluation criterions were functionality, reusability, data storage, complexity, association, maintainability, resilience, reliability, robustness, learnability, availability, flexibility, and visibility. Principal component analysis (PCA) was applied on the feedback data set collected from a survey involving sixty users. Focus was to identify the least correlated features carrying the most independent information variance to optimize the tool selection process. An automatic evaluation method based on Bagging Decision Trees has been used to identify the most suitable editor. Three tools namely Vocbench, TopBraid EVN and Pool Party Thesaurus Manager have been evaluated. Decision tree based analysis recommended the Vocbench and the Pool Party Thesaurus Manager are the better performer than the TopBraid EVN tool with very similar recommendation scores.

With the caveat that sixty (60) users in your organization (the number tested in this study), might reach different results, a useful study of vocabulary software.

More useful for the evaluation criteria to use with vocabulary software than in any absolute guide to the appropriate software.

I first saw this at: New article on vocabulary management tools.

Upcoming release of EuroVoc 4.4, EU’s multilingual thesaurus [December 18, 2012]

Wednesday, December 12th, 2012

Upcoming release of EuroVoc 4.4, EU’s multilingual thesaurus

From the post:

EuroVoc 4.4 will be released on December 18, 2012. During this day, the website might be temporary unavailable.

6.883 thesaurus concepts

This new edition is the result of a thorough revision among other things according to the concepts introduced by the ‘Lisbon Treaty’. It includes 6.883 thesaurus concepts of which 85 concepts are new, 142 have been updated and 28 have been classified as obsolete concepts.

These new concepts are the results of the proposals sent by the librarians from the libraries of the national parliaments in Europe, the European Institutions namely the European Parliament and the users of EuroVoc. All the terms in Portuguese have been revised according to the Portuguese language spelling reform. The prior lexical value remains available as Non-Preferred Terms.

EuroVoc, the EU’s multilingual thesaurus

EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU, the European Parliament in particular. It contains terms in 22 EU languages. It is managed by the Publications Office, which moved forward to ontology-based thesaurus management and semantic web technologies conformant to W3C recommendations as well as latest trends in thesaurus standards.

There are documents prior to this version of the thesaurus and even documents prior to there being a EuroVoc thesaurus at all.

And there will be documents after EuroVoc has been superceded.

Not to mention in between there will be documents that use other vocabularies.

Good thing we have topic maps to use this resource to its best advantage.

A way station in a sea of semantic currents and drifts.

Linked Science Core Vocabulary Specification

Monday, November 12th, 2012

Linked Science Core Vocabulary Specification (revision 0.91)

Abstract:

LSC, the Linked Science Core Vocabulary, is a lightweight vocabulary providing terms to enable publishers and researchers to relate things in science to time, space, and themes. More precisely, LSC is designed for describing scientific resources including elements of research, their context, and for interconnecting them. We introduce LSC as an example of building blocks for Linked Science to communicate the linkage between scientific resources in a machine-understandable way. The “core” in the name refers to the fact that LSC only defines the basic terms for science. We argue that the success of Linked Science—or Linked Data in general—lies in interconnected, yet distributed vocabularies that minimize ontological commitments. More specific terms needed by different scientific communities can therefore be introduced as extensions of LSC. LSC is hosted at LinkedScience.org; please check also other available vocabularies at LinkedScience.org/vocabularies.

A Linked Data vocabulary that you may encounter.

I first saw this in a tweet by Ivan Herman.

VEST Registry. Vocabularies

Wednesday, November 7th, 2012

VEST Registry. Vocabularies

From the webpage:

All the vocabularies in the VEST Registry are classified by type and subject domain. Most of the Vocabularies are related to indexing. The purpose of indexing is to facilitate the search and finding of the content in a collection by the use of controlled/code lists, authority files or controlled subject vocabularies. The indexing ensures that the content will be found by users when they search specifically in information management systems. There are different type of vocabularies like authority files, classification systems, concept maps, controlled lists, ontologies, taxonomies or subject headings. But under the concept Vocabularies you can find as well dictionaries, encyclopedies, glossaries, lexical databases or topic trees.

If I am reading the webpage correctly, 116 separate vocabularies.

Browse through them to get an idea of the range of materials here.

Just on the homepage I see:

African Studies Thesaurus

A structured vocabulary of 12,100 English terms in the field of African studies, the African Studies Thesaurus is developed and maintained by staff at the library of the African Studies Centre Leiden. It is used for indexing and retrieving material in the library collection and is directly linked to the catalogue.

ARABTERM United Nations Multilingual Terminology Database of the Arabic Translation Service

ARABTERM is a multilingual terminology database which provides United Nations nomenclature and special terms in four of the official UN languages – Arabic, English, French and Spanish. The database is mainly intended for use by the language and editorial staff of the United Nations to ensure consistent translation of common terms and phrases used within the Organization.

Biological Entities

This ontology manages reference data about biological species needed for fisheries fact sheets and statistical information, among other resources. Species items are organized and maintained in the Aquatic Science and Fisheries Information System (ASFIS) and currently includes nearly 11.000 species items related to Fisheries and Aquaculture.

CABI thesaurus

The CAB Thesaurus is the essential search tool for all users of the CAB ABSTRACTS™ and Global Health databases and related products. The CAB Thesaurus is not only an invaluable aid for database users but it has many potential uses by individuals and organizations indexing their own information resources for both internal use and on the Internet..

No slight intended towards any vocabulary I didn’t mention. Just a random listing from the homepage.

Why rebuild when you can re-use? (And map.)

FEMA Acronyms, Abbreviations and Terms

Wednesday, October 17th, 2012

FEMA Acronyms, Abbreviations and Terms

From the webpage:

The FAAT List is a handy reference for the myriad of acronyms and abbreviations used within the federal government, emergency management and first response communities. This year’s new edition, which continues to reflect the evolving U.S. Department of Homeland Security, contains an approximately 50 percent increase in the number of entries and definitions bringing the total to over 6,200 acronyms, abbreviations and terms. Some items listed are obsolete, but they are included because they may still appear in publications and correspondence. Obsolete items can be found at the end of this document.

This may be handy for reading FEMA or related government documents.

Hasn’t been updated since 2009.

If you know of a more recent resource, please give a shout.

Thesauri (Vocabularies – TemaTres)

Saturday, August 18th, 2012

Thesauri (Vocabularies – TemaTres)

The TemaTres vocabulary server is important but even more so is its collection of one hundred and fifty vocabularies.

Send a note if you export your vocabulary to a topic map. Interested in examples of mappings between vocabularies.

TemaTres: the open source vocabulary server

Saturday, August 18th, 2012

TemaTres: the open source vocabulary server

From the webpage:

This is the International site for examples and cases on TemaTres, an open source vocabulary server for manage controlled vocabularies, taxonomies and thesaurus.

In this site you can found some resources about tools for knowledge management on digital spaces, TemaTres examples and some hosted vocabularies.

Quick link:

Said to export to:

Skos-Core, Zthes, TopicMap, Dublin Core, MADS, BS8723-5, RSS, SiteMap, txt

Looking at the documentation now.

Separate post coming on vocabularies at this site.

I first saw this at Beyond Search.

The Statistical Core Vocabulary (scovo)

Monday, April 16th, 2012

The Statistical Core Vocabulary (scovo)

From the webpage:

This document specifies an [RDF-Schema] vocabulary for representing statistical data on the Web. It is normatively encoded in [XHTML+RDFa], that is embedded in this page.

The homepage reports this vocabulary as deprecated but cited as a namespace in the RDF Data Cube Vocabulary (1.6).

I don’t have any numbers on the actual use of this vocabulary but you probably need to be aware of it.

Data Documentation Initiative (DDI)

Monday, April 16th, 2012

Data Documentation Initiative (DDI)

From the website:

The Data Documentation Initiative (DDI) is an effort to create an international standard for describing data from the social, behavioral, and economic sciences. Expressed in XML, the DDI metadata specification now supports the entire research data life cycle. DDI metadata accompanies and enables data conceptualization, collection, processing, distribution, discovery, analysis, repurposing, and archiving.

Two current development lines:

DDI-Lifecycle

Encompassing all of the DDI-Codebook specification and extending it, DDI-Lifecycle is designed to document and manage data across the entire life cycle, from conceptualization to data publication and analysis and beyond. Based on XML Schemas, DDI-Lifecycle is modular and extensible.

Users new to DDI are encouraged to use this DDI-Lifecycle development line as it incorporates added functionality. Use DDI-Lifecycle if you are interested in:

  • Metadata reuse across the data life cycle
  • Metadata-driven survey design
  • Question banks
  • Complex data, e.g., longitudinal data
  • Detailed geographic information
  • Multiple languages
  • Compliance with other metadata standards like ISO 11179
  • Process management and automation

The current version of the DDI-L Specification is Version 3.1.  DDI 3.1 was published in October 2009, superseding DDI 3.0 (published in April 2008). 

DDI-Codebook

DDI-Codebook is a more light-weight version of the standard, intended primarily to document simple survey data. Originally DTD-based, DDI-C is now available as an XML Schema.

The current version of DDI-C is 2.5.

Be aware that micro-data in DDI was mentioned in The RDF Data Cube Vocabulary draft as a possible target for “extension” of that proposal.

Suggestions of other domain specific data vocabularies?

Unlike the W3C I don’t see the need for an embrace and extent strategy.

There are enough vocabularies, from ancient to present-day to keep us all busy for the foreseeable future. Without trying to restart every current vocabulary effort.

European Commission launches consultation into e-interoperability

Thursday, February 23rd, 2012

European Commission launches consultation into e-interoperability by Derek du Preez.

From the post:

The European Commission (EC) has launched a one month public consultation into the problem of incompatible vocabularies used by developers of public administration IT systems.

“Core vocabularies” are used to make sharing and reusing data easier, and the EC hopes that if they are defined properly, it will be able to quickly and effectively launch e-Government cross-border services.

The EC has divided the consultation into three separate core vocabularies; person, business and location.

Despite the minimal nature of the core vocabularies, I think the expectations for their use is set by the final paragraph of this report:

Once the public consultation is over, the working groups will seek endorsement from EU Member States. This means that the vocabularies will not become a legal obligation, but will give them further exposure for wider use.

If you have pointers to current incompatible vocabularies, I would appreciate a ping. Just so we can revisit those vocabularies in say five years to see the result of “exposure for wider use.”

Almagame

Saturday, January 14th, 2012

Almagame

Almagame is the software that Tim Wray mentions in his post, vocabulary alignment, meaning and understanding in the world museum, as using a technique called “interactive alignment.”

From the homepage:

Amalgame (AMsterdam ALignment GenerAtion MEtatool) is a tool for finding, evaluating and managing vocabulary alignments. We explicitly do not aim to produce ‘yet another alignment method’ but rather seek to combine existing matching techniques and methods such as those developed within the context of the Ontology Alignment Evaluation Initiative (OAEI), in which different alignment methods can be combined using a workflow setup. The Amalgame Alignment server will feature:

  • A workflow composition functionality, where various alignment generators can be positioned. Their resulting mapping sets can be used as input for filtering methods, other alignment generators or combined into overlap sets.
  • A statistics function, where statistics for alignment sets will be shown
  • An evaluation tool, where subsets of alignments can be evaluated manually

Vocabulary and metadata workflow

The Amalgame toolkit realises the second step of a workflow specified by the Europeana Connect project for SKOSifying vocabularies and converting collection metadata into the EDM (Europeana Data Model). The first step, conversion of XML data into RDF, is supported by the XMLRDF tool.

Amalgame paper at TPDL 2011

We’re happy to announce our paper about Amalgam was accepted as a full paper for the International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011).

The extensive online appendix also contains a rich use case description.

I think you will want to grab the paper, which has the following abstract:

In many heritage institutes, objects are routinely described using terms from predefined vocabularies. When object collections need to be merged or linked, the question arises how those vocabularies relate. In practice it often unclear for data providers how well alignment tools will perform on their specific vocabularies. This creates a bottleneck to align vocabularies, as data providers want to have tight control over the quality of their data. We will discuss the key limitations of current tools in more detail and propose an alternative approach. We will show how this approach has been used in two alignment use cases, and demonstrate how it is currently supported by our Amalgame alignment platform.

I am downloading/installing the software.

I am curious if a similar approach, albeit without converting data into RDF, could be used to create alignments of unstructured vocabularies? Along with reasons for the mappings between vocabularies?

Reasoning in part that there are far more non-structured vocabularies where access could be enhanced with mappings to other vocabularies, along with reasons for the mappings.

Vocabulary alignment, meaning and understanding in the world museum

Saturday, January 14th, 2012

vocabulary alignment, meaning and understanding in the world museum by Tim Wray.

From the post:

We live in a world of silos. Silos data. Silos of culture. Linked Open Data aims to tear down these silos and create unity among the collections, their data and their meaning. The World Museum awaits us.

It comes to no surprise that I begin this post with such Romantic allusions. Our discussions of vocabularies – as technical behemoths and cultural artefacts – were lively and florid at a recent gathering of researchers library and museum professionals at LODLAM-NZ. Metaphors of time and tide – depicted beautifully in this companion post by Ingrid Mason, highlight issues of their expressive power of their meaning over time and across cultures. I present a very broad technical perspective on the matter beginning with a metaphor for what I believe represents the current state of digital cultural heritage : a world of silos.

Among these silos lie vocabularies that describe their collections and induce meaning to their objects. Originally employed to assist cataloguers and disambiguate terms, vocabularies have grown to encompass rich semantic information, often pertaining to the needs of that institution, their collection or their creator communities. Vocabularies themselves are cultural artefacts representing a snapshot of sense making. Like the objects that they describe, vocabularies can depict a range of substance from Cold War paranoia to escapist and consumerist Disneyfication. Inherent within them are the world views, biases, and focal points of their creators. An object’s source vocabulary should always be recorded as a significant part of it’s provenance. Welcome to the recursive hell of meta-meta-data.

Within the context of the museum, vocabularies form the backbone from which collection descriptions are tagged, catalogued or categorised. But there are many vocabularies, and the World Museum needs a universal language. LODLAM-NZ embraced the enthusiasm of a universal language but also understood the immense technical challenges that follow vocabulary alignment and, in many cases, natural language processing in general. However, if done successfully, alignment does a few great things: it normalises the labels that we assign to objects so that a unity of inferencing, reasoning and understanding can occur across vast swathes of collections; it can provide semantic context to those labels for even deeper, more compelling relations among the objects and it can be used to disambiguate otherwise flat or “unsemantified” meta-data, such as small free-text fields and social tags.

Vocabulary alignment is the process of putting two vocabularies side-by-side, finding the best matches, and joining the dots.

Tim’s message is not one of despair.

In fact, he describes how researchers have brought humans back into the picture, seeking to take advantage of what machines do best (simple matches) and what humans do better, more complex matching.

He references a paper and software that I will posting about separately that allow humans to refine mappings.

My only caution is that even human refinements are time and culture bound. That is a refinement that is useful today is a time and cultural artifact (in the archeological sense) that may need replacement today for use by visitors from another culture but certainly when the users are in a later time period.

That is we need to build systems that manage (record/track?) changes in semantic meaning rather than attempting to create semantic edifices designed to hold back the tides of semantic change.

Closing the Knowledge Gap:.. (Lessons for TMs?)

Friday, December 30th, 2011

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

From the post:

Cisco Systems set out to build a system that takes the search for knowledge beyond documents into the content of social network inside the enterprise. The resulting Cisco Pulse platform was built to deliver corporate employees a better understanding who’s communicating with whom, how, and about what. Working with Lucid Imagination, Cisco turned to open source — specifically, Solr/Lucene technology — as the foundation of the search architecture.

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

You need to read the entire article but two things to note:

  • No uniform vocabulary: Every “organization” created its own.
  • Automatic tagging: Content was automatically tagged (read users did not tag)

The article doesn’t go into any real depth about the tagging but it is implied that who created the content and other information is getting “tagged” as well.

I read that to mean in a topic maps context that with the declaration of a vocabulary and automatic tagging, that another process could create associations with roles and role players and other topic map constructs without bothering end users about those tasks.

Not to mention that declaring equivalents between tags as part of the reading/discovery process might be limited to some but not all users.

An incremental or perhaps even evolving authoring of a topic map.

Rather than a dead-tree resource, delivered a fait accompli, a topic map can change as new information or new views of existing/new information are added to the map. (A topic map doesn’t have to be so useful. It can be the equivalent of a dead-tree resource if you really want.)

DQM-Vocabulary

Saturday, October 22nd, 2011

DQM-Vocabulary announced by Christian Fürber:

The DQM-Vocabulary supports data quality management activities in Semantic Web architectures. It’s major strength is the ability to represent data requirements, i.e. prescribed (individual) directives or consensual agreements that define the content and/or structure that constitute high quality data instances and values, so that computers can interpret the requirements and take further actions. Among other things, the DQM-Vocabulary supports the following tasks:

  • Automated creation of data quality monitoring and assessment reports based on previously specified data requirements
  • Exchange of data quality information and data requirements on web-scale
  • Automated consistency checks between data requirements

The DQM-Vocabulary is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license at http://purl.org/dqm-vocabulary/v1/dqm

A primer with examples on how to use the DQM-Vocabulary can be found at http://purl.org/dqm-vocabulary

A mailing list for issues and questions around the DQM-Vocabulary can be found at http://groups.google.com/group/dqm-vocabulary

Interesting work but I think pretty obviously only commercial interests are going to have an incentive to put in the time and effort to use such a system.

Which reminds me, how is this different from the OASIS Universal Business Language (UBL) activity? UBL has already been adopted in a number of countries, particularly for government contracts. They have specified the semantics that businesses need to automate some contractual matters.

I suppose more broadly, where is the commercial demand for the DQM-Vocabulary?

Are there identifiable activities that lack data quality management now, for which DQM will be a solution? If so, which ones?

If other data quality management solutions are in place, what advantages over current systems are offered by DQM? Are those sufficient to justify changing present systems?

Networked Knowledge Organization Systems/Services NKOS

Monday, October 17th, 2011

Networked Knowledge Organization Systems/Services NKOS

From the website:

NKOS is devoted to the discussion of the functional and data model for enabling knowledge organization systems/services (KOS), such as classification systems, thesauri, gazetteers, and ontologies, as networked interactive information services to support the description and retrieval of diverse information resources through the Internet.

Knowledge Organization Systems/Services (KOS) model the underlying semantic structure of a domain. Embodied as Web-based services, they can facilitate resource discovery and retrieval. They act as semantic road maps and make possible a common orientation by indexers and future users (whether human or machine). — Douglas Tudhope, Traugott Koch, New Applications of Knowledge Organization Systems

A wide variety of resources that will interest anyone working with knowledge systems. I would expect any number of these to appear in future posts with comments or observations.

CENDI Science Terminology Locator

Monday, October 17th, 2011

CENDI Science Terminology Locator

Another CENDI resource that merits special mention.

From the webpage:

Browse the terminology resources across the U.S. Federal Science Agencies by selecting a topic and clicking the acronym resource link next to the category.

What you get when following one of the terminology links varies from “page not found” for NASA, RDF as an option at NALT, very complex term navigation (DOE), apparently search results in an agency database (USGS), a listing of terms with definitions and some navigation (DTIC), Descriptor Data (MeSH), “page not found” for NBII, to an outdated link for ERIC, but redirects to a thesaurus navigation page.

If you have someone in government who doesn’t think varying terminologies is an issue, send them this link. The varying responses and what you see when you get there should be proof enough for anyone.

TaxoBank

Monday, October 17th, 2011

TaxoBank: Access, deposit, save, share, and discuss taxonomy resources

From the webpage:

Welcome to the TaxoBank Terminology Registry

The TaxoBank contains information about controlled vocabularies of all types and complexities. We invite you to both browse and contribute. Enjoy term lists for special purpose use, get ideas for building your own vocabulary, perhaps find one that can give you a quicker start.

The information collected about each vocabulary follows a study (TRSS) conducted by JISC, the Joint Information Systems Committee of the Higher and Further Education Funding Councils. All of the recommended fields included in the study’s final report are included; some of those the study identified as Optional are not. See more about the Terminology Registry Scoping Study (TRSS) at their site. In addition, input from other information experts was elicited in planning the site.

This is an interactive web site. To add information about a vocabulary, click on Create Content in the left navigation pane (you’ll need to register as a user first; we just need your name and email). There are only eight required fields, but your listing will be more useful if you complete all the applicable fields about your vocabulary.

Add a comment to almost any page – how you’ve used the vocabulary, what you’d add to it, how you’d use it if expanded to an ontology, etc. Comments are welcome on Event and Blog pages as well. Click on Add Comment, and enter your thoughts. Even anonymous visitors (not signed in) can add comments, but they’ll be reviewed by a site admin before they’re made visible.

You may also update the Events section of the site. Taxonomy, Knowledge Systems, Information Architecture or Management, Metadata are all appropriate event themes. Click on Create Content and then on Events to add a new one (you’ll need to be a registered user).

Contact us through the Contact page, with suggestions, corrections, or to discuss displaying your vocabulary on this site (particularly important if it was created on a college server and faces erasure at the end of the academic year), or if you have questions.

Thank you for visiting (and participating)!

The “Vocabulary spotlight” suggested “Thesaurus of BellyDancing” on my first visit.

To be honest, I had never thought about belly dancing having a thesaurus or even a standard vocabulary for its description.

For class: Browse the listing and pick out an entry for a subject area unfamiliar to you. Prepare a short, say less than 5 minute oral review of the entry. What did you like/dislike, find useful, less than useful, etc. Did any thing about the entry interest you in finding out more about the subject matter or its treatment?

CENDI Agency Terminology Resources

Monday, October 17th, 2011

CENDI Agency Terminology Resources

From the webpage:

The following URLs provide access to the online thesauri and indexing resources of the various federal scientific & technical agencies including CENDI agencies. These resources are of interest to those wishing to know about the scientific and technical terminology used in various fields.

  • Agriculture & Food
  • Applied Science & Technologies
  • Astronomy & Space
  • Biology & Nature
  • Earth & Ocean Sciences
  • Energy & Energy Conservation
  • Environment & Environmental Quality
  • General Science
  • Health & Medicine
  • Physics, Chemistry, and Mathematics
  • Science Education

I will post on CENDI but I thought this was important enough to call out separately. Particularly since there are multiple thesauri in some of these categories.

For example:

NAL Agricultural Thesaurus http://agclass.nal.usda.gov/agt/agt.shtml

The NAL Agricultural Thesaurus (NALT) is annually updated and the 2007 edition contains over 65,800 terms organized into 17 subject categories. NALT is searchable online and is available in several formats (PDF, ASCII text, XML, SKOS) for download from the web site. NALT has standard hierarchical, equivalence and associative relationships and provides scope notes and over 2,400 definitions of terms for clarity. Proposals for new terminology can be sent to thes@nal.usda.gov. Published by the National Agricultural Library, United States Department of Agriculture.

Tesauro Agrícola http://agclass.nal.usda.gov/agt_es.shtml

Tesauro Agrícola is the Spanish language translation of the NAL Agricultural Thesaurus (NALT). The thesaurus accommodates the complexity of the Spanish language from a Western Hemisphere perspective. First published in May 2007, the thesaurus contains over 15,700 translated concepts and contains definitions for more than 2,400 terms. The thesaurus is searchable with a Spanish interface and is available in several formats (PDF, ASCII text, XML) for download from the web site. Proposals for new terminology can be sent to thes@nal.usda.gov . Published by the National Agricultural Library, United States Department of Agriculture.

Project ISO 25964-1 Thesauri and interoperability with other vocabularies

Sunday, October 16th, 2011

Project ISO 25964-1 Thesauri and interoperability with other vocabularies

From the webpage:

This is an international standard development project of ISO Technical Committee 46 (Information and documentation) Subcommittee 9 (Identification and description). The assigned Working Group (known as ISO TC46/SC9/WG8) is revising, merging, and extending two existing international standards: ISO 2788 and ISO 5964. The end product is a new standard—ISO 25964, Information and documentation – Thesauri and interoperability with other vocabularies—supporting the development and application of thesauri in today’s expanding context of networking opportunities. It is being published in two parts, as follows:

ISO 25964, Thesauri and interoperability with other vocabularies

  • Part 1: Thesauri for information retrieval
  • Part 2: Interoperability with other vocabularies

Part 1 was published in August, 2011 and Part 2 is due to appear by the end of 2011.

Unless you have $332 (US) burning a hole in your pocket, you probably want to visit: Format for Exchange of Thesaurus Data Conforming to ISO 25964-1, which has the XML schema plus documentation, etc., await for your use.

I am very interested in how they handled interoperability in part 2.

Data Mining Research Notes – Wiki

Saturday, October 8th, 2011

Data Mining Research Notes – Wiki

You can go to the parent resource but I am deliberately pointing to the “wiki” resource page.

It is a collection of terms from data mining with pointers to Wikipedia pages for each one.

While I may quibble with the readability of some of the work at Wikipedia, I must confess to having created no competing explanations for their consideration.

Perhaps that is something that I could use to fill the idle hours. ;-) Seriously, readable explanations of technical material is both an art form and quite welcome by most technical types. It saves them the time of explanations if anything and possibly helps others become interested.

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation

Sunday, September 25th, 2011

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation by Odile Piton (SAMM), Slim Mesfar (RIADI), and Hélène Pignot (SAMM).

Abstract:

Since 2006 we have undertaken to describe the differences between 17th century English and contemporary English thanks to NLP software. Studying a corpus spanning the whole century (tales of English travellers in the Ottoman Empire in the 17th century, Mary Astell’s essay A Serious Proposal to the Ladies and other literary texts) has enabled us to highlight various lexical, morphological or grammatical singularities. Thanks to the NooJ linguistic platform, we created dictionaries indexing the lexical variants and their transcription in CE. The latter is often the result of the validation of forms recognized dynamically by morphological graphs. We also built syntactical graphs aimed at transcribing certain archaic forms in contemporary English. Our previous research implied a succession of elementary steps alternating textual analysis and result validation. We managed to provide examples of transcriptions, but we have not created a global tool for automatic transcription. Therefore we need to focus on the results we have obtained so far, study the conditions for creating such a tool, and analyze possible difficulties. In this paper, we will be discussing the technical and linguistic aspects we have not yet covered in our previous work. We are using the results of previous research and proposing a transcription method for words or sequences identified as archaic.

Everyone working on search engines needs to print a copy of this article and read it at least once a month.

Seriously, the senses of both words and grammar evolve over centuries and even more quickly. What seem like correct search results from as recently as the 1950′s may be quite incorrect.

For example (I don’t have the episode reference, perhaps someone can suppy it) there was an “I Love Lucy” episode where Lucy says on the phone to RIcky that some visitor (at home) is “making love to her,” which meant nothing more than sweet talk. Not sexual intercourse.

I leave it for your imagination how large the semantic gap may be between English texts and originals composed in another language, culture, historical context and between 2,000 to 6,000 years ago. Flattening the complexities of ancient texts to bumper sticker snippets does a disservice them and ourselves.

SAGA: A DSL for Story Management

Monday, September 12th, 2011

SAGA: A DSL for Story Management by Lucas Beyak and Jacques Carette (McMaster University).

Abstract:

Video game development is currently a very labour-intensive endeavour. Furthermore it involves multi-disciplinary teams of artistic content creators and programmers, whose typical working patterns are not easily meshed. SAGA is our first effort at augmenting the productivity of such teams.

Already convinced of the benefits of DSLs, we set out to analyze the domains present in games in order to find out which would be most amenable to the DSL approach. Based on previous work, we thus sought those sub-parts that already had a partially established vocabulary and at the same time could be well modeled using classical computer science structures. We settled on the ‘story’ aspect of video games as the best candidate domain, which can be modeled using state transition systems.

As we are working with a specific company as the ultimate customer for this work, an additional requirement was that our DSL should produce code that can be used within a pre-existing framework. We developed a full system (SAGA) comprised of a parser for a human-friendly language for ‘story events’, an internal representation of design patterns for implementing object-oriented state-transitions systems, an instantiator for these patterns for a specific ‘story’, and three renderers (for C++, C# and Java) for the instantiated abstract code.

I mention this only in part because of Jack Park’s long standing interest in narrative structures.

The other reason I mention this article is it is a model for how to transition between vocabularies in a useful way.

Transitioning between vocabularies is as nearly a constant theme in computer science as data storage. Not to mention that disciplines, domains, professions, etc., have been transitioning between vocabularies for thousands of years. Some more slowly than other, some terms in legal vocabularies date back centuries.

We need vocabularies and data structures, but with the realization that none of them are final. If you want blind interchangea of topic maps I would strongly suggest that you use one of the standard syntaxes.

But with the realization that you will encounter data that isn’t in a standard topic map syntax. What subjects are represented there? How would you tell others about them? And those vocabularies are going to change over time, just as there were vocabularies before RDF and topic maps.

If you ask an honest MDM advocate, they will tell you that the current MDM effort is not really all that different from MDM in the ’90′s. And MDM may be what you need, depends on your requirements. (Sorry, master data management = MDM.)

The point being that there isn’t any place where a particular vocabulary or “solution” is going to freeze the creativity of users and even programmers, to say nothing of the rest of humanity. Change is the only constant and those who aren’t prepared to deal with it, will be the worse off for it.

iQvoc 3.0 released

Monday, May 9th, 2011

iQvoc 3.0 released

An SKOS tool that is described on its “about” page as:

iQvoc is a web-based open source tool for managing vocabularies (classifications, thesauri, etc.). It combines an intuitive user interface with Semantic Web standards.

The navigation is intuitive, providing direct links and hierarchical tree visualizations. All common browsers are supported. Due to iQvoc’s modular architecture, its appearance can be easily and extensively customized.

iQvoc covers a comprehensive range of capabilities:

  • support for multiple languages in both the user interface and the content corpus (i.e. labels, notes etc.)
  • import/export of existing SKOS vocabularies
  • editorial control and workflow
  • notes and annotations
  • use of the vocabulary within the Linked Data network
  • modularity and extensibility