Archive for September, 2010

Assessing the scenic route: measuring the value of search trails in web logs

Thursday, September 30th, 2010

Assessing the scenic route: measuring the value of search trails in web logs Authors: Ryen W. White, Jeff Huang Keywords: log analysis, search trails, trail following


Search trails mined from browser or toolbar logs comprise queries and the post-query pages that users visit. Implicit endorsements from many trails can be useful for search result ranking, where the presence of a page on a trail increases its query relevance. Follow-ing a search trail requires user effort, yet little is known about the benefit that users obtain from this activity versus, say, sticking with the clicked search result or jumping directly to the destination page at the end of the trail. In this paper, we present a log-based study estimating the user value of trail following. We compare the relevance, topic coverage, topic diversity, novelty, and utility of full trails over that provided by sub-trails, trail origins (landing pages), and trail destinations (pages where trails end). Our findings demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, *including trail recommendation systems that display trails on search result pages.* (emphasis added)

If your topic map client has search logs for internal resources, don’t neglect those as part of your topic map construction process. For identification of important subjects and navigation links between subjects.

This was the best paper for SIGIR 2010.

Plagiarism and Subject Identity

Thursday, September 30th, 2010

Plagiarism detection is a form of detecting subject-sameness.

If you think of a document as a subject and say 95% of it is the same as another document, you could conclude that it is the same subject. (Or set your own level of duplication for subject-sameness.)

One of the early use cases for topic maps was avoiding the duplication of documentation (and billing for the same) for defense systems.

Detecting self-plagiarism from a law firm, vendor, contractor, consultant is one thing.

Putting those incidents together across a government agency, business, institution, or enterprise is a job for topic maps.

Entity Resolution – Journal of Data and Information Quality

Thursday, September 30th, 2010

Special Issue on Entity Resolution.

The Journal of Data and Information Quality is a new journal from the ACM.

Calls for papers should not require ACM accounts for viewing.

I have re-ordered (to put the important stuff first) and reproduced the call below:

Important Dates

  • Submissions due: December 15, 2010
  • Acceptance Notification: April 30, 2011
  • Final Paper Due: June 30, 2011
  • Target Date for Special Issue: September 2011

Resources for authors include:

Topics of interest include, but are not limited to:

  • ER impacts on Information Quality and impacts of Information Quality
    on ER
  • ER frameworks and architectures
  • ER outcome/performance assessment and metrics
  • ER in special application domains and contexts
  • ER and high-performance computing (HPC)
  • ER education
  • ER case studies
  • Theoretical frameworks for ER and entity-based integration
  • Method and techniques for
    • Entity reference extraction
    • Entity reference resolution
    • Entity identity management and identity resolution
    • Entity relationship analysis

Entity resolution (ER) is a key process for improving data quality in data integration in modern information systems. ER covers a wide range of approaches to entity-based integration, known variously as merge/purge, record de-duplication, heterogeneous join, identity resolution, and customer recognition. More broadly, ER also includes a number of important pre- and post-integration activities, such as entity reference extraction and entity relationship analysis. Based on direct record matching strategies, such as those described by the Fellegi-Sunter Model, new theoretical frameworks are evolving to describe ER processes and outcomes that include other types of inferred and asserted reference linking techniques. Businesses have long recognized that the quality of their ER processes directly impacts the overall value of their information assets and the quality of the information products they produce. Government agencies and departments, including law enforcement and the intelligence community, are increasing their use of ER as a tool for accomplishing their missions as well. Recognizing the growing interest in ER theory and practice, and its impact on information quality in organizations, the ACM Journal of Data and Information Quality (JDIQ) will devote a special issue to innovative and high-quality research papers in this area. Papers that address any aspect of entity resolution are welcome.

Natural Language Toolkit

Wednesday, September 29th, 2010

Natural Language Toolkit is a set of Python modules for natural language processing and text analytics. Brought to my attention by Kirk Lowery.

Two near term tasks come to mind:

  • Feature comparison to LingPipe
  • Finding linguistic software useful for topic maps

Suggestions of other toolkits welcome!


Wednesday, September 29th, 2010


The tutorial listing for LingPipe is the best summary of its capabilities.

Its sandbox is another “must see” location.

There may be better introductions to linguistic processing but I haven’t seen them.

International Workshop on Similarity Search and Applications (SISAP)

Tuesday, September 28th, 2010

International Workshop on Similarity Search and Applications (SISAP)


The International Workshop on Similarity Search and Applications (SISAP) is a conference devoted to similarity searching, with emphasis on metric space searching. It aims to fill in the gap left by the various scientific venues devoted to similarity searching in spaces with coordinates, by providing a common forum for theoreticians and practitioners around the problem of similarity searching in general spaces (metric and non-metric) or using distance-based (as opposed to coordinate-based) techniques in general.

SISAP aims to become an ideal forum to exchange real-world, challenging and exciting examples of applications, new indexing techniques, common testbeds and benchmarks, source code, and up-to-date literature through a Web page serving the similarity searching community. Authors are expected to use the testbeds and code from the SISAP Web site for comparing new applications, databases, indexes and algorithms.

Proceedings from prior years, source code, sample data, a real gem of a site.

Mining Billion-node Graphs: Patterns, Generators and Tools

Tuesday, September 28th, 2010

Mining Billion-node Graphs: Patterns, Generators and Tools Author: Christos Faloutsos (CMU)

Presentation on the Pegasus (PETRA GrAph mining System) project.

If you have large amounts of real world data and need some motivation, take a look at this presentation.

Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library

Tuesday, September 28th, 2010

Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library Authors: Haseebulla M. Khan, Kurt Maly and Mohammad Zubair Keywords: OAI – duplicate detection – digital library – federation service


The Open Archives Initiative (OAI) is making feasible to build high level services such as a federated search service that harvests metadata from different data providers using the OAI protocol for metadata harvesting (OAI-PMH) and provides a unified search interface. There are numerous challenges to build and maintain a federation service, and one of them is managing duplicates. Detecting exact duplicates where two records have identical set of metadata fields is straight-forward. The problem arises when two or more records differ slightly due to data entry errors, for example. Many duplicate detection algorithms exist, but are computationally intensive for large federated digital library. In this paper, we propose an efficient duplication detection algorithm for a large federated digital library like Arc.

The authors discovered that title weight was more important than author weight in the discovery of duplicates. Working with a subset of 73 archives with 465,440 records. Would be interesting to apply this insight to a resource like WorldCat, where duplicates are a noticeable problem.

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information

Monday, September 27th, 2010

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information Authors: Kristine A. Pattin, Jiang Gui, Jason H. Moore Keywords: GWAS – SNPs – Protien-protein interaction – Epistasis


Genome wide association studies (GWAS) are now allowing researchers to probe the depths of common complex human diseases, yet few have identified single sequence variants that confer disease susceptibility. As hypothesized, this is due the fact that multiple interacting factors influence clinical endpoint. Given the number of single nucleotide polymorphisms (SNPs) combinations grows exponentially with the number of SNPs being analyzed, computational methods designed to detect these interactions in smaller datasets are thus not applicable. Providing statistical expert knowledge has exhibited an improvement in their performance, and we believe biological expert knowledge to be as capable. Since one of the strongest demonstrations of the functional relationship between genes is protein-protein interactions, we present a method that exploits this information in genetic analyses. This study provides a step towards utilizing expert knowledge derived from public biological sources to assist computational intelligence algorithms in the search for epistasis.

Applying human knowledge “…to assist computational intelligence algorithms…,” sounds like subject identity and topic maps to me!


Monday, September 27th, 2010


From the homepage:

DeepaMehta is a software platform for Knowledge Management. Knowledge is represented in a semantic network and is handled collaboratively. DeepaMehta combines interdisciplinary research with the idea of Open Source to generate a true benefit for workflow as well as for social processes. At the same time Deepa Mehta is an indian movie director.

The DeepaMehta user interface is build according to research in Cognitive Psychology and accomodates the knowledge building process of the individual. Instead of handling information through applications, windows and files with DeepaMehta the user handles all kind of information directly and individually. DeepaMehtas user interface is completely based on Mind Maps / Concept Maps.

Not quite my choice for an interface but then I have spent too many decades with books and similar resources.

A topic map that presents like a printed page but is populated with nodes of associations offering further information would be more to my tastes.

PS: Posted to me by Jack Park

Soft fuzzy rough sets for robust feature evaluation and selection

Monday, September 27th, 2010

Soft fuzzy rough sets for robust feature evaluation and selection Authors: Qinghua Hu, Shuang An and Daren Yu. Keywords: Fuzzy rough sets – Feature evaluation – Noise – Soft fuzzy rough sets – Classification learning – Feature reduction

Introduces techniques that reduce the influence of noise on fuzzy rough sets. Important in a world full of noise.

Question for the ambitious: Survey ten articles on feature reduction that don’t cite each other. Pick 2 features that were eliminated in each article. Do you agree/disagree with the evaluation of those features? Not a question of the numerical calculation but your view of the useful/not useful nature of the feature.

Do ask, do tell

Sunday, September 26th, 2010

Do ask, do tell: a policy for successful semantic integration.

That is, ask and allow others to tell how they identify their subjects.

It does not mean, ask and then tell others a solution, approach, etc. to identify their subjects. (Including FOL.)

Users should be enabled to know when they are talking about the same thing. Using their own vocabularies.

Teach a user to integrate information and they have learned a new skill.

Teach a user to call an expert and they have gained a new bill.

Semantic experts have enough to do without making ordinary vocabularies require expert maintenance.

The General Case

Sunday, September 26th, 2010

The SciDB project illustrates that there is no general case solution for semantic identity.

If we distinguish between IRIs as addresses versus IRIs as identifiers, IRIs are useful for some cases of semantic identity. (IRIs can be used even if you don’t make that distinction, but they are less useful.)

But can you imagine an IRI for each tuple of values in the some 15 petabytes of data annually from the Large Hadron Collider? It may be very important to identify any number of those tuples. Such as if (not when) they discover the Higgs boson.

Those tuples have semantic identity, as do subjects composed of those tuples.

Rather than seeking general solutions for all semantic identity, perhaps we should find solutions that work for particular cases.

SciDB – Numeric Array Database (NAD)

Saturday, September 25th, 2010

SciDB announced its first source-code release Open Letter to the SciDB Community on 24 September 2010.

In Overview of SciDB, Large Scale Array Storage, Processing and Analysis, the SciDB team says scientific data differs from business data because:

  1. scientific analysis typically requires mathematically and algorithmically sophisticated data processing methods
  2. data generated by modern scientific instruments is extremely large

I don’t find those convincing.

The article also claims: “…scientific data has a necessary and implicit ordering; for each element or data value there are other values left, right, up, down, next, previous, or adjacent to it.”

The content of such arrays is always numeric data and you can talk about numeric array databases.

I find the overall approach refreshing because it isn’t aiming for a general solution to all data issues.

Instead, a solution for numeric data in an array.

Now if we can just get past the search for a general semantic solution.

Pastebin for Topic Maps

Friday, September 24th, 2010

Pastebin for topic maps.

From Lars Heuer, a syntax highlighter for topic map syntax.

A proposal for JTM 1.1

Friday, September 24th, 2010

A proposal for JTM 1.1.

Disdaining to create another topic map syntax, Jan Scheiber has proposed JTM 1.1. (Good thing it wasn’t another topic map syntax. 😉 )

Seriously, it is a proposal designed to make using topic maps on mobile devices a more viable option.

REST in the Web3 Platform

Friday, September 24th, 2010

REST in the Web3 Platform.

Graham Moore details the choices made to make the Web3 platform follow “many” of the principles of REST.

While you are there, watch Web3 Platform Getting Started No 1. Good presentation.

Additional resources:

Tutorial on REST: Learn REST: A Tutorial

Fielding, Roy T.; Taylor, Richard N., Principled Design of the Modern Web Architecture

Fielding, Roy Thomas (dissertation), Architectural Styles and the Design of Network-based Software Architectures


See also: Restful Interface to Topic Maps Another REST interface effort.

HUGO Gene Nomenclature Committee

Thursday, September 23rd, 2010

HUGO Gene Nomenclature Committee, a committee assigning unique names to genes.

Become familiar with the HUGO site, then read: The success (or not) of HUGO nomenclature (Genome Biology, 2006).

Now read: Moara: a Java library for extracting and normalizing gene and protein mentions (BMC Bioinformatics 2010)

Q: How you would apply the techniques in the Moara article to build a topic map? Would you keep/discard normalization?

PS: Moara Project (software, etc.)

The Linking Open Data cloud diagram

Thursday, September 23rd, 2010

The Linking Open Data cloud diagram is maintained by This page is maintained by Richard Cyganiak and Anja Jentzsch.

I suppose having DBpedia at the center of linked data is better than the CIA Factbook. 😉

I find large visualizations like this one useful as marketing tools or “that’s cool” examples, but not terribly useful for actual analysis.

Has your experience been different?

KP-Lab Knowledge Practices Lab

Thursday, September 23rd, 2010

KP-Lab Knowledge Practices Lab.

KP-Lab project design and implement a modular, flexible, and extensible ICT system that supports pedagogical methods to foster knowledge creation in educational and workplace settings. The system provides tools for collaborative work around shared objects, and for knowledge practices in the various settings addressed by the project.

Offer the following tools:

  • Knowledge Practices Environment (KPE)
  • The Visual Modeling (Language) Editor
  • Activity System Design Tools (ASDT)
  • Semantic Multimedia Annotation tool (SMAT)
  • Map-It and M2T (meeting practices)
  • The CASS-Query tool
  • The CASS-Memo tool
  • Awareness Services
  • RDF Suite
  • KMS-Persistence API
  • Text Mining Services

Pick any one of these tools and name five (5) things you like about it and five (5) things you dislike about it. How would you change the things you dislike? (General prose description is sufficient.)

Consultative Committee for Space Data Systems (CCSDS)

Wednesday, September 22nd, 2010

Consultative Committee for Space Data Systems (CCSDS) is a collaborative effort to create standards for space data.

Interesting because:

  1. Space exploration get funding from governments
  2. Subjects for mapping in a variety of formats, etc.

Assuming that agreement can be reached on the format for a mission, the question remains how do we integrate that data with articles, books, presentations, data from other missions or sources, and/or analysis of other data?

That agreement is reached on a format for one mission or even one set of data, is just a starting point for a more complicated conversation.

Journal of Cheminformatics

Wednesday, September 22nd, 2010

Journal of Cheminformatics.

Journal of Cheminformatics is an open access, peer-reviewed, online journal encompassing all aspects of cheminformatics and molecular modeling including:

  • chemical information systems, software and databases, and molecular modelling
  • chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases
  • computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques

A good starting place for chemical subject identity issues.


Wednesday, September 22nd, 2010

S-Match. Semantic “matching” software.


S-Match takes any two tree like structures (such as database schemas, classifications, lightweight ontologies) and returns a set of correspondences between those tree nodes which semantically correspond to one another.

It’s late so I have only installed it on an Ubuntu system and run the demo files. Not impressed so far. Run the demo and you will see what I mean.

I will try again over the weekend, in the mean time, if you try it, comments are welcome.

Encryption Using Topic Maps

Tuesday, September 21st, 2010

Topic maps are well suited to message passing in a loose confederation such as hackers.

Any loose confederation of actors could openly distribute information meaningful only to a small group.

Merging would be the key to assembling the correct message. (Imagine the “measurements” of models being merged to form geographic coordinates.)

Messages could be hidden in flood of other messages, only a tiny fraction of which merge.

Suggestions on a “secret” phrase to encode using merging and topic maps? (Must be non-libelous. Just in case it is ever decrypted.)


  1. Would this be more or less secure than a set of XQuery statements against an unknown (to others) public text, the results of which are ordered to display the message? Why?
  2. Would you transmit the merging rules or have them known in advance? Why?
  3. How would you transmit data and/or merging rules?
  4. Would you write merging rules against public data sets? Why?

Elements of Computer Security

Tuesday, September 21st, 2010

Elements of Computer Security Author: David Salomon Keywords: computer security, viruses, worms, security, network security, identity.

Interesting introduction to viruses, worms, etc. Enough technical detail to keep it interesting.

Possible ways to use topic maps in computer security:

  1. Treat the code in viruses, worms, etc. as subjects and use a topic map to document/discover commonalities.
  2. Identify weak code in applications/OSes as subjects and find occurrences in other parts of an application/OS.
  3. Create maps between and literature and applications/OSes.
  4. Create maps of viruses, worms, etc. to specific applications/OSes.
  5. Create maps of sites known to have particular weaknesses

EBLIDA: European Bureau of Library, Information and Documentation Associations

Tuesday, September 21st, 2010

EBLIDA: European Bureau of Library, Information and Documentation Associations.

Take note of funding opportunities and the Vienna Declaration on support for libraries.

If you think of topic maps as extending and making more specific the organization of information about subjects held by a library, the value-add of patron based topic maps becomes obvious. With a modicum of direction, patrons could make the value of a library manifest throughout their community. Done carefully, that value could be shared with other communities as well.

ICEP – Indiana Cheminformatics Education Portal

Tuesday, September 21st, 2010

ICEP – Indiana Cheminformatics Education Portal.

ICEP is a repository of freely accessible cheminformatics educational materials generated by the Indiana University Cheminformatics Program

Learn cheminformatics or a great starting place on cheminformatics subject identity issues.

Cost/Benefit of Semantics

Monday, September 20th, 2010

The cost/benefit ratio of imposing semantics on data is an open area for research.

The cost of creating an index for a technical book is something O’Reilly, for example, can estimate quite closely.

What I haven’t found is a way to estimate the benefit of having such an index?

I deeply appreciate a good index but that isn’t the hard stuff that goes into a cost/benefit calculation.

Imposing semantics at a journal article level and imposing semantics on the contents of articles are two very different costs.

What measure should be used to justify either one?

Astroinformatics 2010

Monday, September 20th, 2010

Astroinformatics 2010.

Conference on astronomical data, its processing and semantics.

The astronomical community has made data interchangeable, in terabyte and soon to be petabyte quantities.


Have they solved the problem of interchangeable semantics?

Or have they reduced semantics to the point interchange becomes easier/possible?

Do semantic interchange problems/issues/opportunities reappear when more semantics are imposed?

What about preservation of semantics?

Similarity Indexing: Algorithms and Performance (1996)

Monday, September 20th, 2010

Similarity Indexing: Algorithms and Performance (1996) Authors: David A. White , Ramesh Jain KeyWords Similarity Indexing, High Dimensional Feature Vectors, Approximate k-Nearest Neighbor Searching, Closest Points, Content-Based Retrieval, Image and Video Database Retrieval.

The authors of this paper coined the phrase “similarity indexing.”

A bit dated now but interesting as a background to techniques currently in use.

A topic map tracing the development of one of the “similarity” techniques would make an excellent thesis project.