Archive for the ‘Tagging’ Category

Robotic Article Tagging (OpenOffice Idea)

Friday, July 31st, 2015

The New York Times built a robot to help make article tagging easier by Justin Ellis.

From the post:

If you write online, you know that a final, tedious part of the process is adding tags to your story before sending it out to the wider world.

Tags and keywords in articles help readers dig deeper into related stories and topics, and give search audiences another way to discover stories. A Nieman Lab reader could go down a rabbit hole of tags, finding all our stories mentioning Snapchat, Nick Denton, or Mystery Science Theater 3000.

Those tags can also help newsrooms create new products and find inventive ways of collecting content. That’s one reason The New York Times Research and Development lab is experimenting with a new tool that automates the tagging process using machine learning — and does it in real time.

The Times R&D Editor tool analyzes text as it’s written and suggests tags along the way, in much the way that spell-check tools highlight misspelled words:

Great post but why not take the “…in much the way that spell-check tools highlight misspelled words” just a step further?

Apache OpenOffice already has spell-checking, so why not improve it to have automatic tagging?

You may or may not know that Open Document Format (ODF) 1.2 was just published as an ISO standard!

Which is the format used by Apache OpenOffice.

Open Document Format (ODF) 1.2 supports RDFa for inline metadata.

Now, imagine for a moment using standard office suite software (Apache OpenOffice) to choose a metadata dictionary and have your content automatically tagged as you type or to insert a document and tags are automatically inserted into the text.

Does that sound like a killer application for your corner of the woods?

A universal dictionary of RDFa tags might be a real memory hog but how many different tags would you need day to day? That’s even an empirical question that could be answered by indexing your documents for the past six (6) months.

With very little effort on the part of users, you can transform your documents from unstructured text to tagged (and proofed) text.

Assemble at the Apache OpenOffice (or LibreOffice) projects if an easy-to-use, easy-to-modify tagging system for office suite software appeals to you.

For other software projects supporting ODF, see: OpenDocument software.

PS: Work is current underway at the ODF TC (OASIS) on robust change tracking support. All we are missing is you.

Why news organizations need to invest in better linking and tagging

Saturday, September 20th, 2014

Why news organizations need to invest in better linking and tagging by Frédéric Filloux.

From the post:

Most media organizations are still stuck in version 1.0 of linking. When they produce content, they assign tags and links mostly to other internal content. This is done out of fear that readers would escape for good if doors were opened too wide. Assigning tags is not exact science: I recently spotted a story about the new pregnancy in the British royal family; it was tagged “demography,” as if it was some piece about Germany’s weak fertility rate.

But there is much more to come in that field. Two factors are are at work: APIs and semantic improvements. APIs (Application Programming Interfaces) act like the receptors of a cell that exchanges chemical signals with other cells. It’s the way to connect a wide variety of content to the outside world. A story, a video, a graph can “talk” to and be read by other publications, databases, and other “organisms.” But first, it has to pass through semantic filters. From a text, the most basic tools extract sets of words and expressions such as named entities, patronyms, places.

Another higher level involves extracting meanings like “X acquired Y for Z million dollars” or “X has been appointed finance minister.” But what about a video? Some go with granular tagging systems; others, such as Ted Talks, come with multilingual transcripts that provide valuable raw material for semantic analysis. But the bulk of content remains stuck in a dumb form: minimal and most often unstructured tagging. These require complex treatments to make them “readable” by the outside world. For instance, a untranscribed video seen as interesting (say a Charlie Rose interview) will have to undergo a speech-to-text analysis to become usable. This processes requires both human curation (finding out what content is worth processing) and sophisticated technology (transcribing a speech by someone speaking super-fast or with a strong accent.)

Great piece on the value of more robust tagging by news organizations.

Rather than tagging as an after-the-fact of publication activity, tagging needs to be part of the work flow that produces content. Tagging as a step in the process of content production avoids creating a mountain of untagged content.

To what end? Well, imagine simple tagging that associates a reporter with named sources in a report. When the subject of that report comes up in the future, wouldn’t it be a time saver to whistle up all the reporters on that subject with a list of their named contacts?

Never having worked in a newspaper I can’t say but that sounds like an advantage to an outsider.

That lesson can be broadened to any company producing content. The data in the content had a point of origin, it was delivered from someone, reported by someone else, etc. Capture those relationships and track the ebb and flow of your data and not just the values it represents.

I first saw this in a tweet by Marin Dimitrov.

Which gene did you mean?

Wednesday, July 16th, 2014

Which gene did you mean? by Barend Mons.


Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.

From within the post:

If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.

Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.

Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.

First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.

Other suggestions for BMC Bioinformatics?

SAMUELS [English Historical Semantic Tagger]

Wednesday, July 9th, 2014

SAMUELS (Semantic Annotation and Mark-Up for Enhancing Lexical Searches)

From the webpage:

The SAMUELS project (Semantic Annotation and Mark-Up for Enhancing Lexical Searches) is funded by the Arts and Humanities Research Council in conjunction with the Economic and Social Research Council (grant reference AH/L010062/1) from January 2014 to April 2015. It will deliver a system for automatically annotating words in texts with their precise meanings, disambiguating between possible meanings of the same word, ultimately enabling a step-change in the way we deal with large textual data. It uses the Historical Thesaurus of English as its core dataset, and will provide for each word in a text the Historical Thesaurus reference code for that concept. Textual data tagged in this way can then be accurately searched and precisely investigated, producing results which can be automatically aggregated at a range of levels of precision. The project also draws on a series of research sub-projects which will employ the software thus developed, testing and validating the utility of the SAMUELS tagger as a tool for wide-ranging further research.

To really appreciate this project, visit SAMUELS English Semantic Tagger Test Site.

There you can enter up to 2000 English words and select low/upper year boundaries!

Just picking a text at random, ;-), I chose:

Greenpeace flew its 135-foot-long thermal airship over the Bluffdale, UT, data center early Friday morning, carrying the message: “NSA Illegal Spying Below” along with a link steering people to a new web site,, which the three groups launched with the support of a separate, diverse coalition of over 20 grassroots advocacy groups and Internet companies. The site grades members of Congress on what they have done, or often not done, to rein in the NSA.

Some terms and Semtag3 by time period:


  • congress: C09d01 [Sexual intercourse]; E07e16 [Inclination]; E08e12 [Movement towards a thing/person/position]
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];


  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; E07e16 [Inclination];
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];


  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; O07 [Conversation];
  • data: H55a [Attestation, witness, evidence];
  • thermal: A04b02 [Spring]; C09a [Sexual desire]; D03c02 [Heat];
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B06d01 [Deformities of specific parts]; B25d [Tools and implements];


  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; O07 [Conversation];
  • data: F04v04 [Data]; H55a [Attestation, witness, evidence]; W05 [Information];
  • thermal: A04b02 [Spring]; B28b [Types/styles of clothing]; D03c02 [Heat];
  • UT: 04.10[Unrecognised]
  • web: B06d01 [Deformities of specific parts]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];


  • congress: 04.10[Unrecognised]
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: 04.10[Unrecognised]

I am assuming that the “04.10[unrecognized]” for all terms in 2000-2014 means there is no usage data for that time period.

I have never heard anyone deny that meanings of words change over time and domain.

What remains a mystery is why the value-add of documenting the meanings of words isn’t obvious?

I say “words,” I should be saying “data.” Remembering the loss of the $125 Million Mars Climate Orbiter. One system read a value as “pounds of force” and another read the same data as “newtons.” In that scenario, ET doesn’t get to call home.

So let’s rephrase the question to: Why isn’t the value-add of documenting the semantics of data obvious?


Online Language Taggers

Tuesday, May 13th, 2014

UCREL Semantic Analysis System (USAS)

From the homepage:

The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects and this page collects together various pointers to those projects and publications produced since 1990.

The semantic tagset used by USAS was originally loosely based on Tom McArthur’s Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. We have written an introduction to the USAS category system (PDF file) with examples of prototypical words and multi-word units in each semantic field.

There are four online taggers available:

English: 100,000 word limit

Italian: 2,000 word limit

Dutch: 2,000 word limit

Chinese: 3,000 character limit


I first saw this in a tweet by Paul Rayson.

Semantic Computing of Moods…

Friday, August 16th, 2013

Semantic Computing of Moods Based on Tags in Social Media of Music by Pasi Saari, Tuomas Eerola. (IEEE Transactions on Knowledge and Data Engineering, 2013; : 1 DOI: 10.1109/TKDE.2013.128)


Social tags inherent in online music services such as provide a rich source of information on musical moods. The abundance of social tags makes this data highly beneficial for developing techniques to manage and retrieve mood information, and enables study of the relationships between music content and mood representations with data substantially larger than that available for conventional emotion research. However, no systematic assessment has been done on the accuracy of social tags and derived semantic models at capturing mood information in music. We propose a novel technique called Affective Circumplex Transformation (ACT) for representing the moods of music tracks in an interpretable and robust fashion based on semantic computing of social tags and research in emotion modeling. We validate the technique by predicting listener ratings of moods in music tracks, and compare the results to prediction with the Vector Space Model (VSM), Singular Value Decomposition (SVD), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). The results show that ACT consistently outperforms the baseline techniques, and its performance is robust against a low number of track-level mood tags. The results give validity and analytical insights for harnessing millions of music tracks and associated mood data available through social tags in application development.

These results make me wonder if the results of tagging represents the average semantic resolution that users want?

Obviously a musician or musicologist would want far finer and sharper distinctions, at least for music of interest to them. Or substitute the domain of your choice. Domain experts want precision, while the average user muddles along with coarser divisions.

We already know from Karen Drabenstott’s work (Subject Headings and the Semantic Web) that library classification systems are too complex for the average user and even most librarians.

On the other hand, we all have some sense of the wasted time and effort caused by the uncharted semantic sea where Google and others practice catch and release with semantic data.

Some of the unanswered questions that remain:

How much semantic detail is enough?

For which domains?

Who will pay for gathering it?

What economic model is best?

Introducing tags to Journal of Cheminformatics

Sunday, March 24th, 2013

Introducing tags to Journal of Cheminformatics by Bailey Fallon.

From the post:

Journal of Cheminformatics will now be “tagging” its publications, allowing articles related by common themes to be linked together.

Where an article has been tagged, readers will be able access all other articles that share the same tag via a link at the right hand side of the HTML, making it easier to find related content within the journal.

This functionality has been launched for three resources that appear frequently in Journal of Cheminformatics and we will continue to add tags when relevant.

  • Open Babel: Open Babel is an open source chemical toolbox that interconverts over 110 chemical data formats. The first paper describing the features and implementation of open Babel appeared in Journal of Cheminformatics in 2011, and this tag links it with a number of other papers that use the toolkit
  • PubChem: PubChem is an open archive for the biological activities of small molecules, which provides search and analysis tools to assist users in locating desired information. This tag amalgamates the papers published in the PubChem3D thematic series with other papers reporting applications and developments of PubChem
  • InChI: The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, making it a valuable tool for cheminformaticians, and this tag links a number of papers in Journal of Cheminformatics that rely on its use

It’s not sophisticated authoring of associations but carefully done, tagging can collate information resources for users.

On export to a topic map application, implied roles could be made explicit, assuming the original tagging was consistent.

Managing Conference Hashtags

Tuesday, November 13th, 2012

David Karger tweets today:

Ironically amusing that ontology researchers can’t manage to agree on a canonical tag for their conference #iswc #iswc12 #iswc2012

If that’s true for ontology researchers, what chance does the rest of the world have?

Just to help ontology researchers along a bit (in LTM syntax):


/* typing topics */

[conf = "conference"]

/* scoping topics */

[SWTwiiter01 : conf = "Semantic Web, Twitter hashtag 01."]

[SWTwiiter02 : conf = "Semantic Web, Twitter hashtag 02."]

[SWTwiiter03 : conf = "Semantic Web, Twitter hashtag 03."]

[iswc2012 : conf = "ISWC 2012, The 11th International Semantic Web Conference"
("#iswc" / SWTwitter01)
("#iswc12" / SWTwitter02)
("#iswc2012" / SWTwitter03)]


I added the “conf” typing topic to the scoping topics to distinguish those tags from other for:

ISWC (International Standard Musical Work Code)

Welcome to ISWC 2013! The International Symposium on Wearable Computers (ISWC)

Wikipedia – ISWC, also lists:

International Speed Windsurfing Class

But missed:

International Student Welcome Committee

There remains the task of distinguishing tags in the wild from tags for these other subjects.

Once that is done, all the tweets about the conference, under these or other tags, can be collocated for a full set of tweets about the conference.

Other subjects and relationships, such as person, date, location, topic, tags, retweets, etc., can be just as easily added.

Personally I would make the default sort order for Tweet a function of date/time, quite possibly mis-using sortname for that purpose. People are accustomed to seeing Tweets in time order and fancy collocation can wait until they select an author, subject, tag, etc.

AgroTagger [Auto-Topic Map Authoring?]

Wednesday, November 7th, 2012


From the webpage:

Used for indexing information resources, Agrotagger is a keyword extractor that uses the AGROVOC thesaurus as its set of allowable keywords. It can extract from Microsoft Office documents, PDF files and web pages.

There are currently several available services that can be accessed either as web interfaces for manual document upload or as REST web services that can be programmatically invoked:

Following up on the AGROVOC thesaurus, FAO thesaurus links with reegle, and found this interesting resource.

Doesn’t seem like a big jump to have a set of keyword that create topics, associations and occurrences With document author(s), journal, place of employment, etc.

Would need proofing but on the other hand could produce a topic map for proofing tout de suite. (No Michel, I had to look it up. 😉 )

Citizen Archivist Dashboard [“…help the next person discover that record”]

Sunday, June 10th, 2012

Citizen Archivist Dashboard

What’s the common theme of these interfaces from the National Archives (United States)?

  • Tag – Tagging is a fun and easy way for you to help make National Archives records found more easily online. By adding keywords, terms, and labels to a record, you can do your part to help the next person discover that record. For more information about tagging National Archives records, follow “Tag It Tuesdays,” a weekly feature on the NARAtions Blog. [includes “missions” (sets of materials for tagging), rated as “beginner,” “intermediate,” and “advanced.” Or you can create your own mission.]
  • Transcribe – By contributing to transcriptions, you can help the National Archives make historical documents more accessible. Transcriptions help in searching for the document as well as in reading and understanding the document. The work you do transcribing a handwritten or typed document will help the next person discover and use that record.

    The transcription tool features over 300 documents ranging from the late 18th century through the 20th century for citizen archivists to transcribe. Documents include letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files.

    [A pilot project with 300 documents but one you should follow. Public transcription (crowd-sourced if you want the popular term) of documents has the potential to open up vast archives of materials.]

  • Edit Articles – Our Archives Wiki is an online space for researchers, educators, genealogists, and Archives staff to share information and knowledge about the records of the National Archives and about their research.

    Here are just a few of the ways you may want to participate:

    • Create new pages and edit pre-existing pages
    • Share your research tips
    • Store useful information discovered during research
    • Expand upon a description in our online catalog

    Check out the “Getting Started” page. When you’re ready to edit, you’ll need to log in by creating a username and password.

  • Upload & Share – Calling all researchers! Start sharing your digital copies of National Archives records on the Citizen Archivist Research group on Flickr today.

    Researchers scan and photograph National Archives records every day in our research rooms across the country — that’s a lot of digital images for records that are not yet available online. If you have taken scans or photographs of records you can help make them accessible to the public and other researchers by sharing your images with the National Archives Citizen Archivist Research Group on Flickr.

  • Index the Census – Citizen Archivists, you can help index the 1940 census!

    The National Archives is supporting the 1940 census community indexing project along with other archives, societies, and genealogical organizations. The release of the decennial census is one of the most eagerly awaited record openings. The 1940 census is available to search and browse, free of charge, on the National Archives 1940 Census web site. But, the 1940 census is not yet indexed by name.

    You can help index the 1940 census by joining the 1940 census community indexing project. To get started you will need to download and install the indexing software, register as an indexing volunteer, and download a batch of images to transcribe. When the index is completed, the National Archives will make the named index available for free.

The common theme?

The tagging entry sums it up with: “…you can do your part to help the next person discover that record.”

That’s the “trick” of topic maps. Once a fact about a subject is found, you can preserve your “finding” for the next person.

How do you measure the impact of tagging on retrieval?

Thursday, May 31st, 2012

How do you measure the impact of tagging on retrieval? by Tony Russell-Rose.

From the post:

A client of mine wants to measure the difference between manual tagging and auto-classification on unstructured documents, focusing in particular on its impact on retrieval (i.e. relevance ranking). At the moment they are considering two contrasting approaches:

See Tony’s post for details.

What do you think?

Closing the Knowledge Gap:.. (Lessons for TMs?)

Friday, December 30th, 2011

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

From the post:

Cisco Systems set out to build a system that takes the search for knowledge beyond documents into the content of social network inside the enterprise. The resulting Cisco Pulse platform was built to deliver corporate employees a better understanding who’s communicating with whom, how, and about what. Working with Lucid Imagination, Cisco turned to open source — specifically, Solr/Lucene technology — as the foundation of the search architecture.

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

You need to read the entire article but two things to note:

  • No uniform vocabulary: Every “organization” created its own.
  • Automatic tagging: Content was automatically tagged (read users did not tag)

The article doesn’t go into any real depth about the tagging but it is implied that who created the content and other information is getting “tagged” as well.

I read that to mean in a topic maps context that with the declaration of a vocabulary and automatic tagging, that another process could create associations with roles and role players and other topic map constructs without bothering end users about those tasks.

Not to mention that declaring equivalents between tags as part of the reading/discovery process might be limited to some but not all users.

An incremental or perhaps even evolving authoring of a topic map.

Rather than a dead-tree resource, delivered a fait accompli, a topic map can change as new information or new views of existing/new information are added to the map. (A topic map doesn’t have to be so useful. It can be the equivalent of a dead-tree resource if you really want.)

Automatically creating tags for big blogs with WordPress (possible upgrade)

Wednesday, December 28th, 2011

Automatically creating tags for big blogs with WordPress (possible upgrade)

Ajay Ohri writes:

I use the simple-tags plugin in WordPress for automatically creating and posting tags. I am hoping this makes the site better to navigate. Given the fact that I had not been a very efficient tagger before, this plugin can really be useful for someone in creating tags for more than 100 (or 1000 posts) especially WordPress based blog aggregators. (added the hyperlink to simple-tags)

I am thinking about possible changes to this blog to make it more useful. Both for me and you.

Curious if anyone has experience with the “simple-tags” plugin? Was it useful?

Do you think it would be useful with the type of material you find here?

Evolutionary Subject Tagging in the Humanities…

Saturday, December 3rd, 2011

Evolutionary Subject Tagging in the Humanities; Supporting Discovery and Examination in Digital Cultural Landscapes by JackAmmerman, Vika Zafrin, Dan Benedetti, Garth W. Green.


In this paper, the authors attempt to identify problematic issues for subject tagging in the humanities, particularly those associated with information objects in digital formats. In the third major section, the authors identify a number of assumptions that lie behind the current practice of subject classification that we think should be challenged. We move then to propose features of classification systems that could increase their effectiveness. These emerged as recurrent themes in many of the conversations with scholars, consultants, and colleagues. Finally, we suggest next steps that we believe will help scholars and librarians develop better subject classification systems to support research in the humanities.

Truly remarkable piece of work!

Just to entice you into reading the entire paper, the authors challenge the assumption that knowledge is analogue. Successfully in my view but I already held that position so I was an easy sell.

BTW, if you are in my topic maps class, this paper is required reading. Summarize what you think are the strong/weak points of the paper in 2 to 3 pages.

Modular Unified Tagging Ontology (MUTO)

Thursday, November 17th, 2011

Modular Unified Tagging Ontology (MUTO)

From the webpage:

The Modular Unified Tagging Ontology (MUTO) is an ontology for tagging and folksonomies. It is based on a thorough review of earlier tagging ontologies and unifies core concepts in one consistent schema. It supports different forms of tagging, such as common, semantic, group, private, and automatic tagging, and is easily extensible.

I though the tagging axioms were worth repeating:

  • A tag has always exactly one label – otherwise it is not a tag.

    (Additional labels can be separately defined, e.g. via skos:Concept.)
  • Tags with the same label are not necessarily semantically identical.

    (Each tag has its own identity and property values.)
  • A tag can itself be a resource of tagging (tagging of tags).

From the properties defined, however, it isn’t clear how to determine when tags do have the same meaning and/or how to communicate that understanding to others?

Ah, or would that be a tagging of a tagging?

That sounds like it leaves a lot of semantic detail on the cutting room floor but it may be that viable semantic systems, oh, say natural languages, do exactly that. Something to think about isn’t it?


Saturday, July 2nd, 2011


From the webpage:

uClassify is a free web service where you can easily create your own text classifiers. You can also directly use classifiers that have already been shared by the community.


  • Language detection
  • Web page categorization
  • Written text gender and age recognition
  • Mood
  • Spam filter
  • Sentiment
  • Automatic e-mail support
  • See below for some examples

So what do you want to classify on? Only your imagination is the limit!

As of 1 July 2011, thirty-seven public classifiers are waiting on you and your imagination.

The emphasis is on tagging documents.

How useful is tagging documents when a search results in > 100 documents? Would your answer be the same or different if the search results were < 20 documents? What if the search results were > 500 documents?

I first saw this at textifter blog in the post A Classifier for the Masses.

Modeling Social Annotation: A Bayesian Approach

Monday, January 3rd, 2011

Modeling Social Annotation: A Bayesian Approach Authors: Anon Plangprasopchok, Kristina Lerman


Collaborative tagging systems, such as Delicious, CiteULike, and others, allow users to annotate resources, for example, Web pages or scientific papers, with descriptive labels called tags. The social annotations contributed by thousands of users can potentially be used to infer categorical knowledge, classify documents, or recommend new relevant information. Traditional text inference methods do not make the best use of social annotation, since they do not take into account variations in individual users’ perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes the interests of individual annotators into account in order to find hidden topics of annotated resources. Unfortunately, that approach had one major shortcoming: the number of topics and interests must be specified a priori. To address this drawback, we extend the model to a fully Bayesian framework, which offers a way to automatically estimate these numbers. In particular, the model allows the number of interests and topics to change as suggested by the structure of the data. We evaluate the proposed model in detail on the synthetic and real-world data by comparing its performance to Latent Dirichlet Allocation on the topic extraction task. For the latter evaluation, we apply the model to infer topics of Web resources from social annotations obtained from Delicious in order to discover new resources similar to a specified one. Our empirical results demonstrate that the proposed model is a promising method for exploiting social knowledge contained in user-generated annotations.


  1. How does (if it does) a tagging vocabulary different from a regular vocabulary? (3-5 pages, no citations)
  2. Would this technique be application to tracing vocabulary usage across cited papers? In other words, following an author backwards through materials they cite? (3-5 pages, no citations)
  3. What other characteristics do you think a paper would have where the usage of a term had shifted to a different meaning? (3-5 pages, no citations)

Survey on Social Tagging Techniques

Monday, December 6th, 2010

Survey on Social Tagging Techniques Authors: Manish Gupta, Rui Li, Zhijun Yin, Jiawei Han Keywords: Social tagging, bookmarking, tagging, social indexing, social classification, collaborative tagging, folksonomy, folk classification, ethnoclassification, distributed classification, folk taxonomy


Social tagging on online portals has become a trend now. It has emerged as one of the best ways of associating metadata with web objects. With the increase in the kinds of web objects becoming available, collaborative tagging of such objects is also developing along new dimensions. This popularity has led to a vast literature on social tagging. In this survey paper, we would like to summarize different techniques employed to study various aspects of tagging. Broadly, we would discuss about properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage. We would discuss topics like why people tag, what influences the choice of tags, how to model the tagging process, kinds of tags, different power laws observed in tagging domain, how tags are created, how to choose the right tags for recommendation, etc. We conclude with thoughts on future work in the area.

I recommend this survey in part due to its depth but also for not lacking a viewpoint:

…But fixed static taxonomies are rigid, conservative, and centralized. [cite omitted]…Hierarchical classifications are influenced by the cataloguer’s view of the world and, as a consequence, are affected by subjectivity and cultural bias. Rigid hierarchical classification systems cannot easily keep up with an increasing and evolving corpus of items…By their very nature, hierarchies tend to establish only one consistent, authoritative structured vision. This implies a loss of precision, erases differences of expression, and does not take into account the variety of user needs and views.

I am not innocent of having made similar arguments in other contexts. It makes good press among the young and dissatisfied, it doesn’t bear up to close scrutiny.

For example, the claim is made that “hierarchical classifications” are “affected by subjectivity and cultural bias.” The implied claim is that social tagging is not. Yes? I would argue that all classification, hierarchical and otherwise is affected by “subjectivity and cultural bias.”


  1. Choose one of the other claims about hierarchical classifications. Is is also true of social tagging? Why/Why not? (3-5 pages, no citations)
  2. Choose a social tagging practice. What are its strengths/weaknesses? (3-5 pages, no citations)
  3. How would you use topic maps with the social tagging practice in #2? (3-5 pages, no citations)

Using Tag Clouds to Promote Community Awareness in Research Environments

Friday, October 15th, 2010

Using Tag Clouds to Promote Community Awareness in Research Environments Authors: Alexandre Spindler, Stefania Leone, Matthias Geel, Moira C. Norrie Keywords: Tag Clouds – Ambient Information – Community Awareness


Tag clouds have become a popular visualisation scheme for presenting an overview of the content of document collections. We describe how we have adapted tag clouds to provide visual summaries of researchers’ activities and use these to promote awareness within a research group. Each user is associated with a tag cloud that is generated automatically based on the documents that they read and write and is integrated into an ambient information system that we have implemented.

One of the selling points of topic maps has been the serendipitous discovery of new information. Discovery is predicated on awareness and this is an interesting approach to that problem.


  1. To what extent does awareness of tagging by colleagues influence future tagging?
  2. How would you design a project to measure the influence of tagging?
  3. Would the influence of tagging change your design of an information interface? Why/Why not? If so, how?

tagging, communities, vocabulary, evolution

Tuesday, October 5th, 2010

tagging, communities, vocabulary, evolution Authors: Shilad Sen, Shyong K. (Tony) Lam, Al Mamunur Rashid, Dan Cosley, Dan Frankowski, Jeremy Osterhouse, F. Maxwell Harper, John Riedl Keywords: communities, evolution, social book-marking, tagging, vocabulary


A tagging community’s vocabulary of tags forms the basis for social navigation and shared expression. We present a user-centric model of vocabulary evolution in tagging communities based on community influence and personal tendency. We evaluate our model in an emergent tagging system by introducing tagging features into the MovieLens recommender system. We explore four tag selection algorithms for displaying tags applied by other community members. We analyze the algorithms’ effect on vocabulary evolution, tag utility, tag adoption, and user satisfaction.

The influence of an interface on the creation of topic maps is an open area for research. Research on tagging behavior is an excellent starting point for such studies.

Question: Would you modify the experimental setup to test the creation of topics? If so, in what way? Why?