Archive for the ‘LCSH’ Category

Constructing a true LCSH tree of a science and engineering collection

Monday, November 19th, 2012

Constructing a true LCSH tree of a science and engineering collection by Charles-Antoine Julien, Pierre Tirilly, John E. Leide and Catherine Guastavino.


The Library of Congress Subject Headings (LCSH) is a subject structure used to index large library collections throughout the world. Browsing a collection through LCSH is difficult using current online tools in part because users cannot explore the structure using their existing experience navigating file hierarchies on their hard drives. This is due to inconsistencies in the LCSH structure, which does not adhere to the specific rules defining tree structures. This article proposes a method to adapt the LCSH structure to reflect a real-world collection from the domain of science and engineering. This structure is transformed into a valid tree structure using an automatic process. The analysis of the resulting LCSH tree shows a large and complex structure. The analysis of the distribution of information within the LCSH tree reveals a power law distribution where the vast majority of subjects contain few information items and a few subjects contain the vast majority of the collection.

After a detailed analysis of records from the McGill University Libraries (204,430 topical authority records) and 130,940 bibliographic records (Schulich Science and Engineering Library), the authors conclude in part:

This revealed that the structure was large, highly redundant due to multiple inheritances, very deep, and unbalanced. The complexity of the LCSH tree is a likely usability barrier for subject browsing and navigation of the information collection.

For me the most compelling part of this research was the focus on LCSH as used and not as it imagines itself. Very interesting reading. A slow walk through the bibliography will interest those researching LCSH or classification more generally.

Demonstration of the power law with the use of LCSH makes one wonder about other classification systems as used.

The Correct End Of Your Telescope – Viewing Adoption

Sunday, November 4th, 2012

The Correct End Of Your Telescope – Viewing Adoption by Richard Wallis.

telescope graphic

I have been banging on about for a while.  For those that have been lurking under a structured data rock for the last year, it is an initiative of cooperation between Google, Bing, Yahoo!, and Yandex to establish a vocabulary for embedding structured data in web pages to describe ‘things’ on the web.  Apart from the simple significance of having those four names in the same sentence as the word cooperation, this initiative is starting to have some impact.  As I reported back in June, the search engines are already seeing some 7%-10% of pages they crawl containing markup.  Like it or not, it is clear that is rapidly becoming a de facto way of marking up your data if you want it to be shared on the web and have it recognised by the major search engines.

It is no coincidence then, at OCLC we chose as the way to expose linked data in WorldCat.  If you haven’t seen it, just search for any item at, scroll to the bottom of the page and open up the Linked Data tab and there you will see the [not very pretty, but hay it’s really designed for systems not humans] marked up linked data for the item, with links out to other data sources such as VIAF, LCSH, FAST, and Dewey. has much to recommend itself but I suspect that HTML remains the “…de facto way of marking up your data if you want it to be shared on the web and have it recognised by the major search engines.”

Ten percent is no mean feat but it is still ten percent.

Downloadable Version of FAST Now Available

Thursday, February 23rd, 2012

Downloadable Version of FAST Now Available

Just in case you are in need of “an enumerative, faceted subject heading schema derived from the Library of Congress Subject Headings (LCSH).”

Thought that would get your attention. Details from the announcement follow:

OCLC Research has made FAST (Faceted Application of Subject Terminology) available for bulk download, along with some minor improvements based on user feedback and routine updates. As with other FAST data, the bulk downloadable versions are available at no charge.

FAST is an enumerative, faceted subject heading schema derived from the Library of Congress Subject Headings (LCSH). OCLC made FAST available as Linked Open Data in December 2011.

The bulk downloadable versions of FAST are offered at no charge. Like FAST content available through the FAST Experimental Linked Data Service, the downloadable versions of FAST are made available under the Open Data Commons Attribution (ODC-By) license.

FAST may be downloaded in either SKOS/RDF format or MARC XML (Authorities format). Users may download the entire FAST file including all eight facets (Personal Names, Corporate Names, Event, Uniform Titles, Chronological, Topical, Geographic, Form/Genre) or choose to download individual facets (see the download information page for more details).

OCLC has enhanced the VoID (“Vocabulary of Interlinked Datasets”) dataset description for improved ease of processing of the license references. Several additions and changes to FAST headings have been made in the normal course of processing new and changed headings in LCSH. OCLC will continue to periodically update FAST based on new and changed headings in LCSH.

About FAST

The FAST authority file, which underlies the FAST Linked Data release, has been created through a multi-year collaboration of OCLC Research and the Library of Congress. Specifically, it is designed to make the rich LCSH vocabulary available as a post-coordinate system in a Web environment. For more information, see the FAST activity page.

…Library of Congress Subject Heading for Social Tags

Monday, August 2nd, 2010

“A Semantic Similarity Approach for Predicting Library of Congress Subject Headings for Social Tags,” by Kwan Yi, appears in JASIST, 61(8):1658-1672, 2010. This is an important article for library students to read. Carefully.

The author recognizes that linking social tags to controlled vocabularies may help with the organization of information that is only socially tagged. And the article is a good review of the application of five popular measures of semantic similarity metrics.

The interesting step for the article would be the reverse of the author’s suggested: “The study of introducing the LCSH to give a control to social tags…”(p. 1670).

Why not introduce “social tags” to enrich the finding experience of users in LCSH settings?

A substantial body of users find information with “social tags,” so why not offer that option?

The user experience with “social tags” along side LCSH headings in a library setting awaits future research.

Authorities and Vocabularies!

Wednesday, June 23rd, 2010

Authorities and Vocabularies at the Library of Congress offers bulk downloads of some of their authorities and vocabularies. Like the Library of Congress subject headings!

Granted it is in RDF but your topic map application is going to encounter RDF eventually. You may as well develop some experience at incorporating it into your topic map as you would any other subject identification system.

Subject World

Tuesday, May 11th, 2010

Subject World (Japanese only)

Subject World is a project to visualize heterogeneous terminology, including catalogs, for use with library catalogs. Uses BSH4 subject headings (Basic Subject Headings) and NDC9 index terms (Nippon Decimal Classification) to visualize and retrieve information from the Osaka City University OPAC.

English language resources:

Subject World: A System for Visualizing OPAC (paper)

Slides with the same title (but different publication from the paper):

Subject World: A System for Visualizing OPAC (slides)

See also: Murakami Harumi Laboratory, in particular its research and publication pages.

Subject Headings and Topic Maps

Monday, May 10th, 2010

Leveraging on prior work should be part of any topic map project.

Building topic maps with subject headings? See: Making topic maps from Subject Headings, a slide pack from Motomu Naito, a regular contributor in the topic maps community.

Project is using NDLSH 2008 (National Diet Library Subject Headings, subject headings 17,953), BSH4 (Basic Subject Headings, Japanese Library Association, subject headings, 7847), LCSH (Library of Congress Subject Headings, subject headings, 372,399).

Slides describe organizing Wikipedia using subject headings, merging subjects with subject headings, and, using LSCH subjects as a bridges to map between subject headings in different languages.

Forward to your local library researcher.

TFM (To Find Me) Scoring

Friday, April 9th, 2010

The TFM (To Find Me) score for a topic map or other information resource depends upon the subject being identified.

Here is a portion of a record from the Library of Congress:

LC Control No.: 2001376890
Type of Material: Book (Print, Microform, Electronic, etc.)
Main Title: Medieval Slavic manuscripts and SGML : problems and
perspectives = Srednovekovni slavi·a·nski rukopisi i
SGML / [Anisava Miltenova, David Birnbaum, editors].
Parallel Title: Srednovekovni slavi·a·nski rukopisi i SGML
Published/Created: Sofii·a· : A.I. “Prof. Marin Drinov”, 2000.
Related Names: Miltenova, Anisava
Birnbaum, David J.
Description: 371 p. : ill. ; 24 cm.
ISBN: 9544307400
Subjects: ***omitted, will cover in another post***
LC Classification: Z115.5.C57 M43 2000
Language Code: eng bul
Other System No.: (OCoLC)ocm45819499
CALL NUMBER: Z115.5.C57 M43 2000

How many ways can you find this book?

  1. Main title: Medieval Slavic manuscripts and SGML : problems and perspectives
  2. Parallel Title: Srednovekovni slavi·a·nski rukopisi i SGML
  3. ISBN: 9544307400
  4. Other System No.: (OCoLC)ocm45819499

TFM score of 4. Four ways to find this book.

But, why the following weren’t included?

  1. LC Control No.: 2001376890
  2. CALL NUMBER: Z115.5.C57 M43 2000

Which would have made the TFM score 6.

Depends on what subject you think is being identified.

If the subject is this book, as a publication, the TFM score remains at 4.

If the subject is a particular copy of this book, held by the Library of Congress, the TFM score goes to 6.

TFM (To Find Me) Mark Twain

Thursday, April 8th, 2010

My TFM (To Find Me) project for today is the Library of Congress catalog and the subject is “Mark Twain.” I started at:, selected “Author Keyword,” and entered “Mark Twain.” Putting in the exact string is a TFM score of 1 but I had to start somewhere.

Results? 36 results in total: 6 personal names, 7 meeting names, and 23 corporate names. Since I am interested in the subject, the author “Mark Twain,” let’s look a bit closer at the returns. The returns include the number of “titles” for each listing, thus the first one is 1 title by “David, Mark Twain.”

  • 1 David, Mark Twain
  • 1 Nadir, Mark Twain, 1913-
  • 17 Twain, Mark.
  • 1438 Twain, Mark, 1835-1910
  • 1 Twain, Mark, 1835-1910 (Spirit)
  • Twain, Mark Mrs., 1845-1904

The fourth entry, “1438 Twain, Mark, 1835-1910” has a more info logo and if we follow that we find: “see also: Clemens, Samuel Langhorne, 1835-1910.” If we follow that, we get:

  • 9 Clemens, Samuel Langhorne, 1835-1910

There is a more info link with a pointer to “Twain, Mark, 1835-1910” at this result.

As it stands now, we have a TFM score of 2 on the subject of Mark Twain (Exact string, Mark Twain and Clemens, Samuel Langhorne). I am curious about the entry with 1438 titles since I am sure that Twain’s literary output was less than that number. Note that “A Connecticut Yankee in King Arthur’s Court” does not appear in the listing of the works by Twain in the third line item. Clearly something is amiss.

Contra Berman

Friday, March 19th, 2010

Sanford Berman has been a major figure in cataloging for decades. His Prejudices and Antipathies: A tract on the LC Subject Heads Concerning People might sound like a dull tract to be read by bored graduate students, but it’s not!

Published in 1971, this work criticizes Library of Congress subject headings. To give the flavor of Berman’s targets and his recommendations:

  • JEWISH QUESTION, “Remedy: Reconstructions are possible for many other inappropriate terms. Not, however, for this. It richly merits deletion.”
  • YELLOW PERIL, “Remedy: Cancel the head and ensure it does not re-appear even as a See referent to other forms.”
  • IDIOCY, IDIOT ASYLUMS, “Discard both ‘idiot’ forms completely….”

I don’t disagree with any of the changes that Berman recommends for current practice, but I do take issue with his recommendations for deletion.

Sanitizing our records will allow us to rest easy that we are beyond such categories as the “JEWISH QUESTION” or “YELLOW PERIL.” Except now we would say without any hesitation, the “MUSLIM QUESTION,” and “BROWN PERIL.” The latter when discussing immigration from Mexico on the Fox news network.

A well constructed topic map for subject headings should not hide our prior ignorance and prejudice from us. Lest we simply choose new victims in place of the old.

Subject Headings and the Semantic Web

Saturday, March 6th, 2010

One of the underlying (and false) presumptions of the Semantic Web is that users have a uniform understanding of the world. One that matches the understanding of ontology authors.

The failure of that presumption was demonstrated over a decade ago in rather remarkable research conducted by Karen Drabenstott (now Marley) on user understanding of Library of Congress subject headings.

Despite the use of Library of Congress subject headings for almost a century, no one before Drabenstott had asked the fundamental question: Does anyone understand Library of Congress subject headings? The study, Understanding Subject Headings in Library Catalogs found that:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

The conclusions one would draw from such a result are easy to anticipate but I will quote from the report:

The developers of new indexing systems especially systems aimed at organizing the World-Wide Web should include children, adults, librarians, and even subject-matter experts in the establishment of new terms and changes to existing ones. Perhaps there should be separate indexing systems for children, adults, librarians, and subject-matter experts. With a click of a button, users could choose the indexing system that works for them in terms of their understanding of the subject matter and the indexing system’s terminology.

Hmmm, users “…choose the indexing system that works for them…,” what a remarkable concept. Topic maps anyone?