Archive for the ‘Subject Headings’ Category

Collaborative Annotation for Scientific Data Discovery and Reuse [+ A Stumbling Block]

Thursday, July 2nd, 2015

Collaborative Annotation for Scientific Data Discovery and Reuse by Kirk Borne.

From the post:

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential ‚Äď humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.

Kirk goes onto say:

…it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies.

Sigh, but we know from experience that has never worked. True, we can develop more common data models, taxonomies and ontologies, but they will be in addition to the present common data models, taxonomies and ontologies. Not to mention that developing knowledge is going to lead to future common data models, taxonomies and ontologies.

If you don’t believe me, take a look at: Library of Congress Subject Headings Tentative Monthly List 07 (July 17, 2015). These subject headings have not yet been approved but they are in addition to existing subject headings.

The most recent approved list: Library of Congress Subject Headings Monthly List 05 (May 18, 2015). For approved lists going back to 1997, see: Library of Congress Subject Headings (LCSH) Approved Lists.

Unless you are working in some incredibly static and sterile field, the basic terms that are found in “common data models, taxonomies and ontologies” are going to change over time.

The only sure bet in the area of knowledge and its classification is that change is coming.

But, Kirk is right, common data models, taxonomies and ontologies are useful. So how do we make them more useful in the face of constant change?

Why not use topics to model elements/terms of common data models, taxonomies and ontologies? Which would enable user to search across such elements/terms by the properties of those topics. Possibly discovering topics that represent the same subject under a different term or element.

Imagine working on an update of a common data model, taxonomy or ontology and not having to guess at the meaning of bare elements or terms? A wealth of information, including previous elements/terms for the same subject being present at each topic.

All of the benefits that Kirk claims would accrue, plus empowering users who only know previous common data models, taxonomies and ontologies, to say nothing of easing the transition to future common data models, taxonomies and ontologies.

Knowledge isn’t static. Our methodologies for knowledge classification should be as dynamic as the knowledge we seek to classify.

SACO: Subject Authority Cooperative Program of the PPC

Friday, February 10th, 2012

SACO: Subject Authority Cooperative Program of the PPC

SACO was established to allow libraries to contribute proposed subject headings to the Library of Congress.

Of particular interest is: Web Resources for SACO Proposals by Adam L. Schiff.

It is a very rich source of reference materials that you may find useful in developing subject heading proposals or subject classifications for other uses (such as topic maps).

But don’t neglect the materials you find on the SACO homepage.

Dragsters, Drag Cars & Drag Racing Cars

Friday, February 10th, 2012

I still remember the cover of Hot Rod magazine that announced (from memory) “The 6’s are here!” Don “The Snake” Prudhomme had broken the 200 mph barrier in a drag race. Other memories follow on from that one but I mention it to explain my interest in a recent Subject Authority Cooperative Program decision to not have a cross-reference from dragster (the term I would have used) to more recent terms, drag cars or drag racing cars.

The expected search (in this order) due to this decision is:

Cars (Automobiles) -> redirect to Automobiles -> Automobiles -> narrower term -> Automobiles, racing -> narrower term -> Dragsters

Adam L. Schiff, proposer of drag cars & drag racing cars says below “This just is not likely to happen.”

Question: Is there a relationship between users “work[ing] their way up and down hierarchies” and display of relationships methods? Who chooses which items will be the starting point to lead to other items? How do you integrate a keyword search into such a system?

Question: And what of the full phrase/sentence AI systems where keywords work less well? How does that work with relationship display systems?

Question: I wonder if the relationship display methods are closer to the up and down hierarchies, but with less guidance?

Adam’s Dragster proposal post in full:


Automobiles has a UF Cars (Automobiles). Since the UF already exists on the basic heading, it is not necessary to add it to Dragsters. The proposal was not approved.

Our proposal was to add two additional cross-references to Dragsters: Drag cars, and Drag racing cars. While I understand, in principle, the reasoning behind the rejection of these additional references, I do not see how it serves users. A user coming to a catalog to search for the subject “Drag cars” will now get nothing, no redirection to the established heading. I don’t see how the presence of a reference from Cars (Automobiles) to Automobiles helps any user who starts a search with “Drag cars”. Only if they begin their search with Cars would they get led to Automobiles, and then only if they pursue narrower terms under that heading would they find Automobiles, Racing, which they would then have to follow further down to Dragsters. This just is not likely to happen. Instead they will probably start with a keyword search on “Drag cars” and find nothing, or if lucky, find one or two resources and think they have it all. And if they are astute enough to look at the subject headings on one of the records and see “Dragsters”, perhaps they will then redo their search.

Since the proposed cross-refs do not begin with the word Cars, I do not at all see how a decision like this is in the service of users of our catalogs. I think that LCSH rules for references were developed when it was expected that users would consult the big red books and work their way up and down hierarchies. While some online systems do provide for such navigation, it is doubtful that many users take this approach. Keyword searching is predominant in our catalogs and on the Web. Providing as many cross-refs to established headings as we can would be desirable. If the worry is that the printed red books will grow to too many volumes if we add more variant forms that weren’t made in the card environment, then perhaps there needs to be a way to include some references in authority records but mark them as not suitable for printing in printed products.

PS: According to ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz, UF, has the following definition:

used for (UF)

A phrase indicating a term (or terms) synonymous with an authorized subject heading or descriptor, not used in cataloging or indexing to avoid scatter. In a subject headings list or thesaurus of controlled vocabulary, synonyms are given immediately following the official heading. In the alphabetical list of indexing terms, they are included as lead-in vocabulary followed by a see or USE cross-reference directing the user to the correct heading. See also: syndetic structure.

I did not attempt to reproduce the extremely rich cross-linking in this entry but commend the entire resource to your attention, particularly if you are a library science student.

Hard-Coding Bias in Google “Algorithmic” Search Results

Wednesday, November 17th, 2010

Hard-Coding Bias in Google “Algorithmic” Search Results.

Not that I want to get into analysis of hard-coding or not in search results but it is an interesting lead into issues a bit closer to home.

To what extent does subject identification have built-in biases that impact user communities?

Or less abstractly, how would we go about discovering and perhaps countering such bias?

For countering the bias you can guess that I would suggest topic maps. ūüėČ

The more pressing question is and one that is relevant to topic map design, is how to discover our own biases?

What seems perfectly natural to me, with a background in law, biblical studies, networking technologies, markup technologies, and now semantic technologies, may seem so to other users.

To make matters worse, how do you ask a user about information they did not find?


  1. How would you survey users to discover biases in subject identification? (3-5 pages, no citations)
  2. How would you discover what information users did not find? (3-5 pages, no citations)
  3. Class project: Design and test a survey for bias in a particular subject identification. (assuming permission from a library)

PS: There are biases in algorithms as well but we will cover those separately.

Subject World

Tuesday, May 11th, 2010

Subject World (Japanese only)

Subject World is a project to visualize heterogeneous terminology, including catalogs, for use with library catalogs. Uses BSH4 subject headings (Basic Subject Headings) and NDC9 index terms (Nippon Decimal Classification) to visualize and retrieve information from the Osaka City University OPAC.

English language resources:

Subject World: A System for Visualizing OPAC (paper)

Slides with the same title (but different publication from the paper):

Subject World: A System for Visualizing OPAC (slides)

See also: Murakami Harumi Laboratory, in particular its research and publication pages.

Subject Headings and Topic Maps

Monday, May 10th, 2010

Leveraging on prior work should be part of any topic map project.

Building topic maps with subject headings? See: Making topic maps from Subject Headings, a slide pack from Motomu Naito, a regular contributor in the topic maps community.

Project is using NDLSH 2008 (National Diet Library Subject Headings, subject headings 17,953), BSH4 (Basic Subject Headings, Japanese Library Association, subject headings, 7847), LCSH (Library of Congress Subject Headings, subject headings, 372,399).

Slides describe organizing Wikipedia using subject headings, merging subjects with subject headings, and, using LSCH subjects as a bridges to map between subject headings in different languages.

Forward to your local library researcher.

TFM (To Find Me) Scoring

Friday, April 9th, 2010

The TFM (To Find Me) score for a topic map or other information resource depends upon the subject being identified.

Here is a portion of a record from the Library of Congress:

LC Control No.: 2001376890
Type of Material: Book (Print, Microform, Electronic, etc.)
Main Title: Medieval Slavic manuscripts and SGML : problems and
perspectives = Srednovekovni slavi·a·nski rukopisi i
SGML / [Anisava Miltenova, David Birnbaum, editors].
Parallel Title: Srednovekovni slavi·a·nski rukopisi i SGML
Published/Created: Sofii¬∑a¬∑ : A.I. “Prof. Marin Drinov”, 2000.
Related Names: Miltenova, Anisava
Birnbaum, David J.
Description: 371 p. : ill. ; 24 cm.
ISBN: 9544307400
Subjects: ***omitted, will cover in another post***
LC Classification: Z115.5.C57 M43 2000
Language Code: eng bul
Other System No.: (OCoLC)ocm45819499
CALL NUMBER: Z115.5.C57 M43 2000

How many ways can you find this book?

  1. Main title: Medieval Slavic manuscripts and SGML : problems and perspectives
  2. Parallel Title: Srednovekovni slavi·a·nski rukopisi i SGML
  3. ISBN: 9544307400
  4. Other System No.: (OCoLC)ocm45819499

TFM score of 4. Four ways to find this book.

But, why the following weren’t included?

  1. LC Control No.: 2001376890
  2. CALL NUMBER: Z115.5.C57 M43 2000

Which would have made the TFM score 6.

Depends on what subject you think is being identified.

If the subject is this book, as a publication, the TFM score remains at 4.

If the subject is a particular copy of this book, held by the Library of Congress, the TFM score goes to 6.

Contra Berman

Friday, March 19th, 2010

Sanford Berman has been a major figure in cataloging for decades. His Prejudices and Antipathies: A tract on the LC Subject Heads Concerning People might sound like a dull tract to be read by bored graduate students, but it’s not!

Published in 1971, this work criticizes Library of Congress subject headings. To give the flavor of Berman’s targets and his recommendations:

  • JEWISH QUESTION, “Remedy: Reconstructions are possible for many other inappropriate terms. Not, however, for this. It richly merits deletion.”
  • YELLOW PERIL, “Remedy: Cancel the head and ensure it does not re-appear even as a See referent to other forms.”
  • IDIOCY, IDIOT ASYLUMS, “Discard both ‘idiot’ forms completely….”

I don’t disagree with any of the changes that Berman recommends for current practice, but I do take issue with his recommendations for deletion.

Sanitizing our records will allow us to rest easy that we are beyond such categories as the “JEWISH QUESTION” or “YELLOW PERIL.” Except now we would say without any hesitation, the “MUSLIM QUESTION,” and “BROWN PERIL.” The latter when discussing immigration from Mexico on the Fox news network.

A well constructed topic map for subject headings should not hide our prior ignorance and prejudice from us. Lest we simply choose new victims in place of the old.

Subject Headings and the Semantic Web

Saturday, March 6th, 2010

One of the underlying (and false) presumptions of the Semantic Web is that users have a uniform understanding of the world. One that matches the understanding of ontology authors.

The failure of that presumption was demonstrated over a decade ago in rather remarkable research conducted by Karen Drabenstott (now Marley) on user understanding of Library of Congress subject headings.

Despite the use of Library of Congress subject headings for almost a century, no one before Drabenstott had asked the fundamental question: Does anyone understand Library of Congress subject headings? The study, Understanding Subject Headings in Library Catalogs found that:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

The conclusions one would draw from such a result are easy to anticipate but I will quote from the report:

The developers of new indexing systems especially systems aimed at organizing the World-Wide Web should include children, adults, librarians, and even subject-matter experts in the establishment of new terms and changes to existing ones. Perhaps there should be separate indexing systems for children, adults, librarians, and subject-matter experts. With a click of a button, users could choose the indexing system that works for them in terms of their understanding of the subject matter and the indexing system’s terminology.

Hmmm, users “…choose the indexing system that works for them…,” what a remarkable concept. Topic maps anyone?