Archive for the ‘Uncategorized’ Category

Semantic Data Integration For Free With IO Informatics’ Knowledge Explorer Personal Edition

Monday, December 12th, 2011

Semantic Data Integration For Free With IO Informatics’ Knowledge Explorer Personal Edition

From the post:

Bioinformatics software provider IO Informatics recently released its free Knowledge Explorer Personal Edition. Version 3.6 of the Personal Edition can handle most of what Knowledge Explorer Professional 3.6, launched in October, can, but it does all its work in memory without direct connectivity to a back-end database.

“In particular, a lot of the strengths of Knowledge Explorer have to do with modeling data as RDF and then testing queries, visualizing and browsing the data to see that you have the ontologies and data mappings you need for your integration and application requirements.” says Robert Stanley, IO Informatics president and CEO. The Personal version is aimed at academic experts focused on data integration and semantic data modeling, as well as personal power users in life sciences and other data-intensive industries, or anyone who wants to learn the tool in anticipation of leveraging their enterprise data sets for collaboration and integration projects.

The latest Knowledge Explorer 3.6 feature set extends the thesaurus application in the product, so that users can bring in additional thesauri and vocabularies, as well as the user interaction options for importing, merging and modifying ontologies. For the Pro edition, IO Informatics has also been working with database vendors to increase query speed and loading.

I am not sure what we did collectively to merit presents so early in the holiday seasons but I won’t spend a lot of time worrying about it.

Particularly interested in the “…additional thesauri and vocabularies…” aspect of the software. In part because it isn’t that big a step to a topic map to add in which could help provide context and other factors to better enable integration of information.

Oh, and from further down on the webpage:

Stanley sees a number of potential applications for those who might like to try the Personal version for integrating and modeling smaller data sets. “Maybe a customer has a number of reports on protein expression experiments and lot of clinical data associated with that, including healthcare records and various report spreadsheets, and they must integrate those to do some research for themselves or their internal customers,” he says, as one example. “You can do that even using the Personal version to create a well integrated, semantically formatted file.”

Sure and when researchers move on, how do their successors maintain those integrations? Inquiring minds want to know? What do we do about semantic rot?

Getting Genetics Done

Wednesday, March 9th, 2011

Getting Genetics Done

Interesting blog site for anyone interested in genetics research and/or data mining issues related to genetics.

If you are looking for a community building exercise, see the Journal club entries.

Wikipedia Page Traffic Statistics Dataset

Thursday, March 3rd, 2011

Wikipedia Page Traffic Statistics Dataset

Data Wrangling reports a data set of 320 GB sample of Wikipedia traffic.

Thoughts on similar sample data sets for topic maps?

Sizes, subjects, complexity?

Data Engine Roundup

Wednesday, February 23rd, 2011

Data Engine Roundup

Mathew Hurst provides a quick listing of data engines (including his own, which merits a close look).

80-50 Rule?

Thursday, January 20th, 2011

Watzlawick1 recounts the following experiment:

That there is no necessary connection between fact and explanation was illustrated in a recent experiment by Bavelas (20): Each subject was told he was participating in an experimental investigation of “concept formation” and was given the same gray, pebbly card about which he was to “formulate concepts.” Of every pair of subjects (seen separately but concurrently) one was told eight out of ten times at random that what he said about the card was correct; the other was told five out of ten times at random what he said about the card was correct. The ideas of the subject who was “rewarded” with a frequency of 80 per cent remained on a simple level, which the subject who was “rewarded” only at a frequency of 50 per cent evolved complex, subtle, and abstruse theories about the card, taking into consideration the tiniest detail of the card’s composition. When the two subjects were brought together and asked to discuss their findings, the subject with the simpler ideas immediately succumbed to the “brilliance” of the other’s concepts and agreed the other had analyzed the card correctly.

I repeat this account because it illustrates the impact that “reward” systems can have on results.

Whether the “rewards” are members of a crowd or experts.


  1. Should you randomly reject searches in training to search for subjects?
  2. What literature supports your conclusion in #1? (3-5 pages)

This study does raise the interesting question of whether conferences should track and randomly reject authors to encourage innovation.

1. Watzlawick, Paul, Janet Beavin Bavelas, and Don D. Jackson. 1967. Pragmatics of human communication; a study of interactional patterns, pathologies, and paradoxes. New York: Norton.

idk (I Don’t Know)

Sunday, December 5th, 2010

What are you using to act as the placeholder for an unknown player of a role?

That is in say a news, crime or accident investigation, there is an association with specified roles, but only some facts and not the identity of all the players is known.

For example, in the recent cablegate case, when the story of the leaks broke, there was clearly an association between the leaked documents and the leaker.

The leaker had a number of known characteristics, the least of which was ready access to a wide range of documents. I am sure there were others.

To investigate that leak with a topic map, I would want to have a representative for the player of that role, to which I can assign properties.

I started to publish a subject identifier for the subject idk (I Don’t Know) to act as that placeholder but then thought it needs more discussion.

This has been in my blog queue for a couple of weeks so another week or so before creating a subject identifier won’t hurt.

The problem, which you already spotted, is that TMDM governed topic maps are going to merge topics with the idk (I Don’t Know) subject identifier. Which would in incorrect in many cases.

Interesting that it would not be wrong in all cases. That is I could have two associations, both of which have idk (I Don’t Know) subject identifiers and I want them to merge on the basis of other properties. So in that case the subject identifiers should merge.

I am leaning towards simply defining the semantics to be non-merger in the absence of merger on some other specified basis.


PS: I kept writing the expansion idk (I Don’t Know) because a popular search engine suggested Insane Dutch Killers as the expansion. Wanted to avoid any ambiguity.

Classification and Novel Class Detection in Data Streams with Active Mining

Friday, November 12th, 2010

Classification and Novel Class Detection in Data Streams with Active Mining Authors(s): Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham


We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data. Most of the existing data stream classification techniques address only the infinite length and concept-drift problems. Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Concept-evolution occurs in the stream when novel classes arrive. However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training. This assumption is impractical, since data labeling is both time consuming and costly. Therefore, it is impossible to label a majority of the data points in a high-speed data stream. This scarcity of labeled data naturally leads to poorly trained classifiers. ActMiner actively selects only those data points for labeling for which the expected classification error is high. Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems. It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

I would have liked this article better had it not said that the details of the test data could be found in another article.

Specifically: Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B.M.: “Integrating novel class detection with classification for concept-drifting data streams.” In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD c 2009. LNCS, vol. 5782, pp. 79–94. Springer, Heidelberg (2009)

Which directed me to: “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams,”

I leave it as an exercise for the readers to guess the names of the authors of the last paper.

Otherwise interesting research marred by presentation in dribs and drabs.

Now that I have all three papers I will have to see what questions arise, other than questionable publishing practices.

Searching with Tags: Do Tags Help Users Find Things?

Friday, November 12th, 2010

Searching with Tags: Do Tags Help Users Find Things? Authors: Margaret E.I. Kipp, and D. Grant Campbell


This study examines the question of whether tags can be useful in the process of information retrieval. Participants searched a social bookmarking tool specialising in academic articles (CiteULike) and an online journal database (Pubmed). Participant actions were captured using screen capture software and they were asked to describe their search process. Users did make use of tags in their search process, as a guide to searching and as hyperlinks to potentially useful articles. However, users also made use of controlled vocabularies in the journal database to locate useful search terms and of links to related articles supplied by the database.

Good review of the literature, such as it is, on use of user supplied tagging for searching.

Worth reading on the question raised about the use of tags but there is another question lurking in the background.

The authors say in various forms:

The ability to discover useful resources is of increasing importance where web searches return 300 000 (or more) sites of unknown relevance and is equally important in the realm of digital libraries and article databases. The question of the ability to locate information is an old one and led directly to the creation of cataloguing and classification systems for the organisation of knowledge. However, such systems have not proven to be truly scalable when dealing with digital information and especially information on the web.

Since at least 1/3 of the web is pornography and that is not usually relevant to scientific, technical or medical searching, we can reduce the searching problem by 1/3 right there. I don’t know the percentage for shopping, email archives, etc., but when you come down to the “core” literature for field, it really isn’t all that large is it?


  1. Do search applications need to “scale” to web size or just enough to cover “core” literature? (discussion)
  2. For library science, how would you go about constructing a list of the “core” literature? (3-5 pages, no citations)
  3. If you use tagging, describe your experience with assigning tags. (3-5 pages, no citations)
  4. If you use tagging for searching purposes, describe your experience (3-5 pages, no citations)

First seen at: ResourceBlog

Google Refine 2.0 – Announcement

Thursday, November 11th, 2010

Google Refine 2.0 has been released.

From the website:

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch…[screencasts]

If you don’t watch any videos this month, you have to watch!

Google uses the term reconciliation but what is being demonstrated is mapping information to a subject representative.

Note that unlike topic maps, the basis (read properties) for that mapping is not disclosed, so it isn’t possible for a program or person to be sure to repeat the same mapping.

Semantic-Distance Based Clustering for XML Keyword Search

Wednesday, November 10th, 2010

Semantic-Distance Based Clustering for XML Keyword Search Authors(s): Weidong Yang, Hao Zhu Keywords: XML, Keyword Search, Clustering


XML Keyword Search is a user-friendly information discovery technique, which is well-suited to schema-free XML documents. We propose a novel scheme for XML keyword search called XKLUSTER, in which a novel semantic-distance model is proposed to specify the set of nodes contained in a result. Based on this model, we use clustering approaches to generate all meaningful results in XML keyword search. A ranking mechanism is also presented to sort the results.

The author’s develop an interesting notion of “semantic distance” and then say:

Strictly speaking, the searching intentions of users can never be confirmed accurately; so different than existing researches, we suggest that all keyword nodes are useful more or less and should be included in
results. Based on the semantic distance model, we divide the set of keyword nodes X into a group of smaller sets, and each of them is called a “cluster”.

Well…, but the goal is to present the user with results relevant to their query, not results relevant to some query.

Still, an interesting paper and one that XML types will enjoy reading.

Rule Synthesizing from Multiple Related Databases

Tuesday, November 9th, 2010

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

Is 303 Really Necessary? – Blog Post

Thursday, November 4th, 2010

Is 303 Really Necessary?.

Ian Davis details at length why 303’s are unnecessary and offers an interesting alternative.

Read the comments as well.

Guessing Explanations?

Tuesday, October 19th, 2010

Our apparent inability to imagine other audiences keeps nagging at me.

If that is true for hierarchical arrangements, then it must be true for indexes as well.

So far, standard topic maps sort of thinking.

What if that applies to explanations as well?

That is I create better explanations when I imagine the audience to be like me.

And don’t try to guess what others will find a good explanation?

Why not test explanations with audiences?

Make explanation, even of topic maps, a matter of empirical investigation rather than formal correctness.

Using Tag Clouds to Promote Community Awareness in Research Environments

Friday, October 15th, 2010

Using Tag Clouds to Promote Community Awareness in Research Environments Authors: Alexandre Spindler, Stefania Leone, Matthias Geel, Moira C. Norrie Keywords: Tag Clouds – Ambient Information – Community Awareness


Tag clouds have become a popular visualisation scheme for presenting an overview of the content of document collections. We describe how we have adapted tag clouds to provide visual summaries of researchers’ activities and use these to promote awareness within a research group. Each user is associated with a tag cloud that is generated automatically based on the documents that they read and write and is integrated into an ambient information system that we have implemented.

One of the selling points of topic maps has been the serendipitous discovery of new information. Discovery is predicated on awareness and this is an interesting approach to that problem.


  1. To what extent does awareness of tagging by colleagues influence future tagging?
  2. How would you design a project to measure the influence of tagging?
  3. Would the influence of tagging change your design of an information interface? Why/Why not? If so, how?

The Linking Open Data cloud diagram

Thursday, September 23rd, 2010

The Linking Open Data cloud diagram is maintained by This page is maintained by Richard Cyganiak and Anja Jentzsch.

I suppose having DBpedia at the center of linked data is better than the CIA Factbook. 😉

I find large visualizations like this one useful as marketing tools or “that’s cool” examples, but not terribly useful for actual analysis.

Has your experience been different?

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Sunday, September 19th, 2010

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Destined to be a deeply influential resource.

Read the paper, use the application for a week Chem2Bio2RDF, then answer these questions:

  1. Choose three (3) subjects that are identified in this framework.
  2. For each subject, how is it identified in this framework?
  3. For each subject, have you seen it in another framework or system?
  4. For each subject seen in another framework/system, how was it identified there?

Extra credit: What one thing would you change about any of the identifications in this system? Why?

Experience in Extending Query Engine for Continuous Analytics

Sunday, September 5th, 2010

Experience in Extending Query Engine for Continuous Analytics by Qiming Chen and Meichun Hsu has this problem statement:

Streaming analytics is a data-intensive computation chain from event streams to analysis results. In response to the rapidly growing data volume and the increasing need for lower latency, Data Stream Management Systems (DSMSs) provide a paradigm shift from the load-first analyze-later mode of data warehousing….

Moving from load-first analyze-later has implications for topic maps over data warehouses. Particularly when events that are subjects may only have a transient existence in a data stream.

This is on my reading list to prepare to discuss TMQL in Leipzig.

PS: Only five days left to register for TMRA 2010. It is a don’t miss event.

Post Early, Post Often

Thursday, August 19th, 2010

Apologies for the lack of a post for August 18, 2010.

I was working on a post late yesterday evening when my ISP lost connectivity to the Net. 🙁

I could not stay up late enough to see if it would be repaired before the end of the day.

Hence, no post for August 18, 2010.

Have a lot of stuff in the queue so will try to get an early post out most days.

PGS – Pretty Good Semantics

Thursday, August 5th, 2010

PGS – Pretty Good Semantics is the result of months of conversation with Sam Hunting.

Our starting premise: Users want to say things of interest to them, as simply as possible, for them.

Note the focus on users. Not on description logic. Not on formal ontologies. Not on reasoning, artificial or otherwise. Not even on complex mappings between identifications. But on users.

All of those other things are worthwhile enterprises, some of them anyway, which you can pursue your own leisure.

The question is how to empower users to say things about what interests them? And if possible, how to do so without re-writing the WWW to deal with 303 clouds, etc. ?

Our answer to those questions: PGS – Pretty Good Semantics. It asks very little of users yet can annotate any identifier on the WWW to say whatever a user likes.

It uses existing HTML techniques and works with existing web servers and search engines.


Lesson for Topic Maps?

Friday, July 16th, 2010

In an exchange over a MapReduce resource, Robert Barta observed how large that ecosystem has grown in just a year, and suggested there is a lesson for the TM community in that growth. But what lesson is that? (He didn’t say, but I have written to ask.)

“MapReduce” isn’t a cooler a name than “Topic Maps” so that’s not lesson.

MapReduce isn’t less complex than topic maps so that’s not the lesson as well.

Two issues that MapReduce does not face:

  1. Users resisted (and still do resist) markup because it requires making explicit choices about the structure of a text. We learn text structures from users, but for the most part, they are reluctant to name those parts. Is there an analogy to making subjects explicit for a topic map?
  2. If we identify our subjects (our insider vocabulary), then what makes us special will be known by others.

MapReduce doesn’t face the first issue because users can create whatever mapping they wish, without ever saying explicitly what subjects are involved. It also preserves the special nature of insider vocabularies since it has no explicit mechanism for identifying subjects.

Are those the lessons? If they are, are there work arounds? Are there other lessons?

Pragmatic Topic Map Streaming – From Semantic Headache

Tuesday, July 6th, 2010

Pragmatic Topic Map Streaming by Jan Schreiber raises some interesting questions about how to construct a data stream for a topic map.

I particularly like the idea of creating mini-topic maps as it were. See his post for the details.

He did not touch on was how topic map stream software would recognize subjects. A topic map stream creator with a configurable subject recognition would be really useful. Most of us could use the “topic maps subjects” recognition filter while others, interested in dull subjects like the World Cup (just teasing) could have a subject filter for it. Some of us could have both, feeding different topic maps.

iPhone Opportunity for Topic Maps

Sunday, July 4th, 2010

The You Say God Is Dead? There’s an App for That story in the New York Times, July 2, 2010, looks like an opportunity for topic maps.

For publishers, it would be possible to map responses on the basis of topics and let the topic map handle the details of where that is the appropriate response to an “opposing” app. It should shorten the update/production cycle as new material is added to counter new arguments or variations of old ones.

On the product side, publishers could use topic maps to enable users to respond to a variety of ways of naming or phrasing particular issues. In debates over religion, as in all other areas, differences in terminology can make it difficult to come to grips with the opposing side.

Depending on how it was implemented, a topic map app could integrate other resources, ranging from study materials to personal contacts as they relate to this application. Think of a topic map as being able to bridge between data held in mini-silos on an iPhone. So users could add in information into the app that was useful to them in such debates.

Any other critical points I should make as I contact publishers of these apps to recommend topic maps?

PS: Did anyone with an iPhone try out tmjs from Jan Schreiber? I really don’t want to have to buy an iPhone just for that. Help me out here.

Constructions from Dots and Lines

Monday, June 14th, 2010

Constructions from Dots and Lines by Marko A. Rodriguez and Peter Neubauer is an engaging introduction to graphs and why they are important.


A graph is a data structure composed of dots (i.e. vertices) and lines (i.e. edges). The dots and lines of a graph can be organized into intricate arrangements. The ability for a graph to denote objects and their relationships to one another allow for a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not necessarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.

JErlang: Erlang with Joins

Friday, June 11th, 2010

JErlang: Erlang with Joins by Hubert Ploiniczak should interest anyone implementing distributed topic map systems.

The value of having a distributed architecture (did I hear “Internet?”) has been lost on the Semantic Web advocates. With topic maps you can have multiple locations that “resolve” identifiers to other identifiers and pass on information about something that has been identified.

Most existing topic maps look like data silos but that is more a matter of habit than architectural limitation.

I should put in a plug for the Springer Alert Service, which brought the article with the same title, JErlang: Erlang with Joins to my attention. Highly recommended as a way to stay current on the latest CS research. Remember articles don’t have to say “topic map” in the title or abstract to be relevant.

PS: Topic map observations: The final report and article have the same name. In topic maps the different locations for the items would be treated as subject locators, thus allowing them to retain the same name but being distinguished one from the other. Note that the roles differ with the two subjects as well. Susan Eisenbach is the supervisor of the final report and is a co-author of the article reported by Springer.

The Fourth Paradigm: Data-intensive Scientific Discovery

Wednesday, June 9th, 2010

Jack Park points to The Forth Paradigm: Data-Intensive Scientific Discovery as a book that merits our attention.

Indeed it does! Lurking just beneath the surface of data-intensive research are questions of semantics. Diverse semantics. How does data-intensive research occur in a multi-semantic world?

Paul Ginsparg (Cornell University), in Text in a Data-centric World, has the usual genuflection towards “linked data” without stopping to consider the cost of evaluating every URI to decide if it is an identifier or a resource. Nor why adding one more name to the welter of names we have now (that is the semantic diversity problem) is going to make our lives any better?

Ginsparg writes:

Such an articulated semantic structure [linked data] facilitates simpler algorithms acting on World Wide Web text and data and is more feasible in the near term than building a layer of complex artificial intelligence to interpret free-form human ideas using some probabilistic approach.

Solving the “perfect language” problem, which has never been solved, is more feasible than “…building a layer of complex artificial intelligence to interpret free-form human ideas using some probabilistic approach” to solve it for us?

Perhaps so but one wonders why that is a useful observation?

On the “perfect language” problem, see The Search for the Perfect Language by Umberto Eco.

The Future of the Journal

Saturday, June 5th, 2010

The Future of the Journal is another slide deck by Anita de Waard that reads like a promotional piece for topic maps, sans any mention of topic maps.

While Anita makes a strong case for annotation of data in science publishing, the same is true for government, legal, environmental, business, finance, etc., publications. All publications are as complex as depicted on these slides. It isn’t as obvious in the humanities because that “data” has been locked away so long that we have forgotten it is there.

The more complex the information we record, via “annotations” or some other mechanism, the greater the need for librarians to organize it and help us find it. Self-help in research is like the guy about to do a self-appendectomy with his doctor’s advice over the phone. Doable, maybe, but the results are pretty much what you would expect.

Rather than future of the journal, I would say: Future of Information.