Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 26, 2010

Semantic Compression

Filed under: Cataloging,Indexing,Semantic Diversity — Patrick Durusau @ 12:55 pm

It isn’t difficult to find indexing terms to represent documents.

But, whatever indexing terms are used, a large portion of relevant documents will go unfound. As much as 80% of the relevant documents. See Size Really Does Matter… (A study of full text searching but the underlying problem is the same: “What term was used?”)

You read a document, are familiar with its author, concepts, literature it cites, the relationships of that literature to the document and the relationships between the ideas in the document. Now you have to choose one or more terms to represent all the semantics and semantic relationships in the document. The exercise you are engaged in is compressing the semantics in a document into one or more terms.

Unlike data compression, a la Shannon, the semantic compression algorithm used by any user is unknown. We know it isn’t possible to decompress an indexing term to recover all the semantics of a document it purports to represent. Since a term is used to represent several documents, the problem is even worse. We would have to decompress the term to recover the semantics of all the documents it represents.

Even without the algorithm used to assign indexing (or tagging) terms, investigation of semantic compression could be useful. For example, encoding the semantics of a set of documents (to a set depth) and then asking groups of users to assign those documents indexing or tagging terms. By varying the semantics in the documents, it may, emphasis on may, be possible to experimentally derive partial semantic decompression for some terms and classes of users.

June 7, 2010

The Value of Indexing

Filed under: Citation Indexing,Indexing,Subject Identity — Patrick Durusau @ 8:46 am

The Value of Indexing (2001) by Jan Sykes is a promotion piece for Factiva, a Dow Jones and Reuters Company, but is also a good overview of the value of indexing.

I find it interesting in its description of the use of a taxonomy for indexing purposes. You may remember from reading a print index the use of the term “see also.” This paper appears to argue that the indexing process consists of mapping one or more terms to a single term in the controlled vocabulary.

A single entry from the controlled vocabulary represents a particular concept no matter how it was referred to in the original article. (page 5)

I assume the mapping between the terms in the article and the term in the controlled vocabulary is documented. That mapping maybe of more interest to the professionals who create the indexes and power users than the typical user.

Perhaps that is a lesson in terms of what is presented to users of topic maps.

Delivery of the information a user wants/needs in their context is more important than demonstrating our cleverness.

That was one of the mistakes in promoting markup, too much emphasis on the cool, new, paradigm shifting and too little emphasis on the benefit to users. With office products that use markup in a non-visible manner to the average user, markup usage has spread rapidly around the world.

Suggestions on how to make that happen for topic maps?

PS: Obviously this is an old piece so in fairness I am contacting Factiva to advise them of this post and to ask if they have an updated paper, etc. that they might want me to post. I will take the opportunity to plug topic maps as well. 😉

June 6, 2010

Citation Indexing – Semantic Diversity – Exercise

Filed under: Citation Indexing,Exercises,Indexing,Semantic Diversity — Patrick Durusau @ 10:48 am

In A Conceptual View of Citation Indexing, which is chapter 1 of Citation Indexing — Its Theory and Application in Science, Technology, and Humanities (1979), Garfield says of the problem of changing terminology and semantics:

Citations, used as indexing statements, provide these lost measures of search simplicity, productivity, and efficiency by avoiding the semantics problems. For example, suppose you want information on the physics of simple fluids. The simple citation “Fisher, M.E., Math. Phys., 5,944, 1964” would lead the searcher directly to a list of papers that have cited this important paper on the subject. Experience has shown that a significant percentage of the citing papers are likely to be relevant. There is no need for the searcher to decide which subject terms an indexer would be most likely to use to describe the relevant papers. The language habits of the searcher would not affect the search results, nor would any changes in scientific terminology that took place since the Fisher paper was published.

In other words, the citation is a precise, unambiguous representation of a subject that requires no interpretation and is immune to changes in terminology. In addition, the citation will retain its precision over time. It also can be used in documents written in different languages. The importance of this semantic stability and precision to the search process is best demonstrated by a series of examples.

Question: What subject does a citation represent?

Question: What “precision” does the citation retain over time?

Exercise: Select any article that interests you with more than twenty (20) non-self citations. Identify ten (10) ideas in the article and examine at least twenty (20) citing articles. Why was your article cited? Was your article cited for an idea you identified? Was your article cited for an idea you did not identify? (Either one is correct. This is not a test of guessing why an article will be cited. It is exploration of a problem space. Your fact finding is important.)

Extra credit: Did you notice any evidence to support or contradict the notion that citation indexing avoids the issue of semantic diversity? If your article has been cited for more than ten (10) years, try one or two citations per year for every year it is cited. Again, your factual observations are important.

Citation Indexing

Eugene Garfield’s homepage may not be familiar to topic map fans but it should be.

Garfield invented citation indexing in the late 1950’s/early 1960’s.

Among the treasures you will find here:

June 4, 2010

representing scientific discourse, or: why triples are not enough

Filed under: Classification,Indexing,Information Retrieval,Ontology,RDF,Semantic Web — Patrick Durusau @ 4:15 pm

representing scientific discourse, or: why triples are not enough by Anita de Waard, Disruptive Technologies Director (how is that for a cool title?), Elsevier Labs, merits a long look.

I won’t spoil the effect by trying to summarize the presentation.  It is only 23 slides long.

Read those slides carefully and then get yourself to: Rhetorical Document Structure Group HCLS IG W3C. Read, discuss, contribute.

PS: Based on this slide pack I am seriously thinking of getting a Twitter account so I can follow Anita. Not saying I will but am as tempted as I have ever been. This looks very interesting. Fertile ground for discussion of topic maps.

Hadoop-HBase-Lucene-Mahout-Nutch-Solr Digests

Filed under: Indexing,MapReduce,Search Engines,Software — Patrick Durusau @ 5:40 am

More interests than time?

Digests of developments in May 2010:

Hadoop

HBase

Lucene

Mahout

Nutch

Solr

Suggestions of other digest type sources and/or comments on such sources deeply appreciated.

May 15, 2010

Semantic Indexing

Filed under: Authoring Topic Maps,Indexing,Information Retrieval,Semantics — Patrick Durusau @ 6:41 pm

Semantic indexing and searching using a Hopfield net

Automatic creation of thesauri as a means of dealing with the “vocabulary problem.”

Another topic map construction tool.

A bit dated, 1997, but will run this line of research forward and report back.

With explicit subject identity, machine generated thesauri could be reliably interchanged.

And improved upon by human users.

May 14, 2010

Society of Indexers

Filed under: Indexing — Patrick Durusau @ 12:38 pm

Society of Indexers.

British and Irish professional body for indexing.

Some highlights:

James Lamb’s “Human or computer produced indexes?” (Reads like a one page brief for topic maps.)

Valerie A. Elliston’s “Indexing children’s books”

How to become an indexer Resources, pointers, distance-learning courses, etc.

The Indexer

Filed under: Indexing — Patrick Durusau @ 11:23 am

The Indexer: The International Journal of Indexing

Subscription (with The American Society for Indexing membership) is an affordable £ 28.

Public access to Indexer back issues dating from 2006 – 1958!

Subject recognition lies at the heart of indexing. The same can be said for topic maps, with the addition of making the subjects recognized explicit for use and reuse of others. This is an opportunity to study with experts on subject recognition.

Not a dull publication! Indexes Reviewed April 2005, has “Indexes praised,” “Indexes censured” (highly amusing) and, comments on the index to “My Life” by Bill Clinton.

The American Society for Indexing

Filed under: Indexing — Patrick Durusau @ 7:49 am

The American Society for Indexing

Topic maps arose from the ashes of an indexing project.

What better place to look for resources than an indexing society?

Journal: Key Words. The sampling of articles Key Words: Sample Articles has me reaching for my plastic to get a membership!

Special Interest Groups (Sam Hunting: Note the Culinary SIG)

Publications, resources, etc. Check it out!

(Thanks to Christopher Courington, a former student from UIUC, for reminding me of their journal and site.)

March 6, 2010

Subject Headings and the Semantic Web

Filed under: Indexing,LCSH,Subject Headings — Tags: , , , , — Patrick Durusau @ 5:08 pm

One of the underlying (and false) presumptions of the Semantic Web is that users have a uniform understanding of the world. One that matches the understanding of ontology authors.

The failure of that presumption was demonstrated over a decade ago in rather remarkable research conducted by Karen Drabenstott (now Marley) on user understanding of Library of Congress subject headings.

Despite the use of Library of Congress subject headings for almost a century, no one before Drabenstott had asked the fundamental question: Does anyone understand Library of Congress subject headings? The study, Understanding Subject Headings in Library Catalogs found that:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

The conclusions one would draw from such a result are easy to anticipate but I will quote from the report:

The developers of new indexing systems especially systems aimed at organizing the World-Wide Web should include children, adults, librarians, and even subject-matter experts in the establishment of new terms and changes to existing ones. Perhaps there should be separate indexing systems for children, adults, librarians, and subject-matter experts. With a click of a button, users could choose the indexing system that works for them in terms of their understanding of the subject matter and the indexing system’s terminology.

Hmmm, users “…choose the indexing system that works for them…,” what a remarkable concept. Topic maps anyone?

« Newer Posts

Powered by WordPress