Archive for the ‘Uncategorized’ Category

Developing a Solr Plugin

Saturday, April 27th, 2013

Developing a Solr Plugin by Andrew Janowczyk.

From the post:

For our flagship product, Searchbox.com, we strive to bring the most cutting-edge technologies to our users. As we’ve mentioned in earlier blog posts, we rely heavily on Solr and Lucene to provide the framework for these functionalities. The nice thing about the Solr framework is that it allows for easy development of plugins which can greatly extend the capabilities of the software. We’ll be creating a set of slideshares which describe how to implement 3 types of plugins so that you can get ahead of the learning curve and start extending your own custom Solr installation now.

There are mainly 4 types of custom plugins which can be created. We’ll discuss their differences here:

Sometimes Andrew says three (3) types of plugins and sometimes he says four (4).

I tried to settle the question by looking at the Solr Wiki on plugins.

Depends on how you want to count separate plugins. ;-)

But, Andrew’s advice about learning to write plugins is sound. It will put your results above those of others.

Operation Asymptote – [PlainSite / Aaron Swartz]

Sunday, January 20th, 2013

Operation Asymptote

Operation Asymptote’s goal is to make U.S. federal court data freely available to everyone.

The data is available now, but free only up to $15 worth every quarter.

Serious legal research hits that limit pretty quickly.

The project does not cost you any money, only some of your time.

The result will be another source of data to hold the system accountable.

So, how real is your commitment to doing something effective in memory of Aaron Swartz?

Cancer, NLP & Kaiser Permanente Southern California (KPSC)

Sunday, August 5th, 2012

Kaiser Permanente Southern California (KPSC) deserves high marks for the research in:

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm by Justin A Strauss, et. al.

Abstract:

Objective Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.

Materials and methods SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.

Results Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.

Discussion Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.

Conclusion SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.

Before I forget:

Data sharing statement SCENT is freely available for non-commercial use and modification. Program source code and requisite support files may be downloaded from: http://www.kp-scalresearch.org/research/tools_scent.aspx

Topic map promotion point: Application was built to account for linguistic variability, not to stamp it out.

Tools build to fit users are more likely to succeed, don’t you think?

Groves: The Past Is Close Behind

Tuesday, July 3rd, 2012

I was innocently looking for something else when I encountered:

In HyTime ISO/IEC 10744:1997 “3. Definitions (3.35)”: graph representation of property values is ‘An abstract data structure consisting of a directed graph of nodes in which each node may be connected to other nodes by labeled arcs.’ (http://xml.coverpages.org/groves.html)

That sounds like a data structure that a property graph can represent quite easily.

Does it sound that way to you?

Regrets – June 18-19, 2012

Monday, June 18th, 2012

Apologies but I will not be making technical posts to Another Word For It on June 18th or June 19th, 2012.

Medical testing that was supposed to end mid-day on Monday has spread over onto Tuesday. And most of Tuesday at that.

I don’t want to post unless I think the information is useful and/or I have something useful to say about the information. I’m ok but can’t focus enough to promise either one.

On the “bright” side, I hope to return to posting on Wednesday (June 20, 2012) and am only a few posts away from #5,000!

I appreciate well wishes but be aware that I won’t be answering emails during this time period as well. I stole a few minutes to make this post.

Infinite Weft (Exploring the Old Aesthetic)

Tuesday, April 10th, 2012

Infinite Weft (Exploring the Old Aesthetic)

Jer Thorp writes:

How can a textile function as a digital object? This is a central question of Infinite Weft, a project that I’ve been working on for a the last few months. The project is a collaboration with my mother, Diane Thorp, who has been weaving for almost 40 years – it’s a chance for me to combine my usually screen-based digital practice with her extraordinary hand-woven work. It’s also an exploration of mathematics, computational history, and the concept of pattern.

Most of us probably know that the loom played a part in the early days of computing – the Jacquard loom was the first machine to use punch cards, and its workings were very influential in the early design of programmable machines (In my 1980s basement this history was actually physically embodied; sitting about 10 feet away from my mother’s two floor looms, on an Ikea bookself, sat a box of IBM punch cards that we mostly used to make paper airplanes out of). But how many of us know how a loom actually works? Though I have watched my mother weave many times, it didn’t take long at the start of this project to realize that I had no real idea how the binary weaving patterns called ‘drawdowns‘ ended up making a pattern in a textile.

[graphic omitted]

To teach myself how this process actually happened, I built a functional software loom, where I could see the pattern manifest itself in the warp and weft (if you have Chrome you can see it in action here – better documentation is coming soon). This gave me a kind of sandbox which let me see how typical weaving patterns were constructed, and what kind of problems I could expect when I started to write my own. And run into problems, I did. My first attempts at generating patterns were sloppy and boring (at best) and the generative methods I was applying weren’t very successful. Enter Ralph E. Griswold.

By this point, “concept of pattern,” “punch cards,” “software loom,” and “Ralph E. Griswold,” I was completely hooked.

Comments?

Would You Know “Good” XML If It Bit You?

Tuesday, February 14th, 2012

XML is a pale imitation of a markup language. It has resulted in real horrors across the markup landscape. After years in its service, I don’t have much hope of that changing.

But, the Princess of the Northern Marches has organized a war council to consider how to stem the tide of bad XML. Despite my personal misgivings, I wish them well and invite you to participate as you see fit.

Oh, and I found this message about the council meeting:

International Symposium on Quality Assurance and Quality Control in XML

Monday August 6, 2012
Hotel Europa, Montréal, Canada

Paper submissions due April 20, 2012.

A one-day discussion of issues relating to Quality Control and Quality Assurance in the XML environment.

XML systems and software are complex and constantly changing. XML documents are highly varied, may be large or small, and often have complex life-cycles. In this challenging environment quality is difficult to define, measure, or control, yet the justifications for using XML often include promises or implications relating to quality.

We invite papers on all aspects of quality with respect to XML systems, including but not limited to:

  • Defining, measuring, testing, improving, and documenting quality
  • Quality in documents, document models, software, transformations, or queries
  • Case studies in the control of quality in an XML environment
  • Theoretical or practical approaches to measuring quality in XML
  • Does the presence of XML, XML schemas, and XML tools make quality checking easier, harder, or even different from other computing environments
  • Should XML transforms and schemas be QAed as software? Or configuration files? Or documents? Does it matter?

Paper submissions due April 20, 2012.

Details at: http://www.balisage.net/QA-QC/

You do have to understand the semantics of even imitation markup languages before mapping them with more robust languages. Enjoy!

Ambiguity in the Cloud

Thursday, December 15th, 2011

If you are interested at all in cloud computing and its adoption, you need to read US Government Cloud Computing Technology Roadmap Volume I Release 1.0 (Draft). I know, a title like that is hardly inviting. But read it anyway. Part of a three volume set, for the other volumes see: NIST Cloud Computing Program.

Would you care to wager on out of ten (10) requirements, how many cited a need for interoperability that is presently lacking due to different understandings, terminology, in other words, ambiguity?

Good decision.

The answer? 8 out of 10 requirements cited by NIST have interoperability as a component.

The plan from NIST is to develop a common model, which will be a useful exercise, but how do we discuss differing terminologies until we can arrive at a common one?

Or allow for discussion of previous SLAs, for example, after we have all moved onto a new terminology?

If you are looking for a “hot” topic that could benefit from the application of topic maps (as opposed to choir programs at your local church during the Great Depression) this could be the one. One of those is a demonstration of a commercial grade technology, the other is at best a local access channel offering. You pick which is which.

Semantic Data Integration For Free With IO Informatics’ Knowledge Explorer Personal Edition

Monday, December 12th, 2011

Semantic Data Integration For Free With IO Informatics’ Knowledge Explorer Personal Edition

From the post:

Bioinformatics software provider IO Informatics recently released its free Knowledge Explorer Personal Edition. Version 3.6 of the Personal Edition can handle most of what Knowledge Explorer Professional 3.6, launched in October, can, but it does all its work in memory without direct connectivity to a back-end database.

“In particular, a lot of the strengths of Knowledge Explorer have to do with modeling data as RDF and then testing queries, visualizing and browsing the data to see that you have the ontologies and data mappings you need for your integration and application requirements.” says Robert Stanley, IO Informatics president and CEO. The Personal version is aimed at academic experts focused on data integration and semantic data modeling, as well as personal power users in life sciences and other data-intensive industries, or anyone who wants to learn the tool in anticipation of leveraging their enterprise data sets for collaboration and integration projects.

The latest Knowledge Explorer 3.6 feature set extends the thesaurus application in the product, so that users can bring in additional thesauri and vocabularies, as well as the user interaction options for importing, merging and modifying ontologies. For the Pro edition, IO Informatics has also been working with database vendors to increase query speed and loading.

I am not sure what we did collectively to merit presents so early in the holiday seasons but I won’t spend a lot of time worrying about it.

Particularly interested in the “…additional thesauri and vocabularies…” aspect of the software. In part because it isn’t that big a step to a topic map to add in which could help provide context and other factors to better enable integration of information.

Oh, and from further down on the webpage:

Stanley sees a number of potential applications for those who might like to try the Personal version for integrating and modeling smaller data sets. “Maybe a customer has a number of reports on protein expression experiments and lot of clinical data associated with that, including healthcare records and various report spreadsheets, and they must integrate those to do some research for themselves or their internal customers,” he says, as one example. “You can do that even using the Personal version to create a well integrated, semantically formatted file.”

Sure and when researchers move on, how do their successors maintain those integrations? Inquiring minds want to know? What do we do about semantic rot?

Getting Genetics Done

Wednesday, March 9th, 2011

Getting Genetics Done

Interesting blog site for anyone interested in genetics research and/or data mining issues related to genetics.

If you are looking for a community building exercise, see the Journal club entries.

Wikipedia Page Traffic Statistics Dataset

Thursday, March 3rd, 2011

Wikipedia Page Traffic Statistics Dataset

Data Wrangling reports a data set of 320 GB sample of Wikipedia traffic.

Thoughts on similar sample data sets for topic maps?

Sizes, subjects, complexity?

Data Engine Roundup

Wednesday, February 23rd, 2011

Data Engine Roundup

Mathew Hurst provides a quick listing of data engines (including his own, which merits a close look).

80-50 Rule?

Thursday, January 20th, 2011

Watzlawick1 recounts the following experiment:

That there is no necessary connection between fact and explanation was illustrated in a recent experiment by Bavelas (20): Each subject was told he was participating in an experimental investigation of “concept formation” and was given the same gray, pebbly card about which he was to “formulate concepts.” Of every pair of subjects (seen separately but concurrently) one was told eight out of ten times at random that what he said about the card was correct; the other was told five out of ten times at random what he said about the card was correct. The ideas of the subject who was “rewarded” with a frequency of 80 per cent remained on a simple level, which the subject who was “rewarded” only at a frequency of 50 per cent evolved complex, subtle, and abstruse theories about the card, taking into consideration the tiniest detail of the card’s composition. When the two subjects were brought together and asked to discuss their findings, the subject with the simpler ideas immediately succumbed to the “brilliance” of the other’s concepts and agreed the other had analyzed the card correctly.

I repeat this account because it illustrates the impact that “reward” systems can have on results.

Whether the “rewards” are members of a crowd or experts.

Questions:

  1. Should you randomly reject searches in training to search for subjects?
  2. What literature supports your conclusion in #1? (3-5 pages)

This study does raise the interesting question of whether conferences should track and randomly reject authors to encourage innovation.

1. Watzlawick, Paul, Janet Beavin Bavelas, and Don D. Jackson. 1967. Pragmatics of human communication; a study of interactional patterns, pathologies, and paradoxes. New York: Norton.

idk (I Don’t Know)

Sunday, December 5th, 2010

What are you using to act as the placeholder for an unknown player of a role?

That is in say a news, crime or accident investigation, there is an association with specified roles, but only some facts and not the identity of all the players is known.

For example, in the recent cablegate case, when the story of the leaks broke, there was clearly an association between the leaked documents and the leaker.

The leaker had a number of known characteristics, the least of which was ready access to a wide range of documents. I am sure there were others.

To investigate that leak with a topic map, I would want to have a representative for the player of that role, to which I can assign properties.

I started to publish a subject identifier for the subject idk (I Don’t Know) to act as that placeholder but then thought it needs more discussion.

This has been in my blog queue for a couple of weeks so another week or so before creating a subject identifier won’t hurt.

The problem, which you already spotted, is that TMDM governed topic maps are going to merge topics with the idk (I Don’t Know) subject identifier. Which would in incorrect in many cases.

Interesting that it would not be wrong in all cases. That is I could have two associations, both of which have idk (I Don’t Know) subject identifiers and I want them to merge on the basis of other properties. So in that case the subject identifiers should merge.

I am leaning towards simply defining the semantics to be non-merger in the absence of merger on some other specified basis.

Suggestions?

PS: I kept writing the expansion idk (I Don’t Know) because a popular search engine suggested Insane Dutch Killers as the expansion. Wanted to avoid any ambiguity.

Classification and Novel Class Detection in Data Streams with Active Mining

Friday, November 12th, 2010

Classification and Novel Class Detection in Data Streams with Active Mining Authors(s): Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham

Abtract:

We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data. Most of the existing data stream classification techniques address only the infinite length and concept-drift problems. Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Concept-evolution occurs in the stream when novel classes arrive. However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training. This assumption is impractical, since data labeling is both time consuming and costly. Therefore, it is impossible to label a majority of the data points in a high-speed data stream. This scarcity of labeled data naturally leads to poorly trained classifiers. ActMiner actively selects only those data points for labeling for which the expected classification error is high. Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems. It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

I would have liked this article better had it not said that the details of the test data could be found in another article.

Specifically: Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B.M.: “Integrating novel class detection with classification for concept-drifting data streams.” In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD c 2009. LNCS, vol. 5782, pp. 79–94. Springer, Heidelberg (2009)

Which directed me to: “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams,” http://www.utdallas.edu/?mmm058000/reports/UTDCS-13-09.pdf

I leave it as an exercise for the readers to guess the names of the authors of the last paper.

Otherwise interesting research marred by presentation in dribs and drabs.

Now that I have all three papers I will have to see what questions arise, other than questionable publishing practices.

Searching with Tags: Do Tags Help Users Find Things?

Friday, November 12th, 2010

Searching with Tags: Do Tags Help Users Find Things? Authors: Margaret E.I. Kipp, and D. Grant Campbell

Abstract:

This study examines the question of whether tags can be useful in the process of information retrieval. Participants searched a social bookmarking tool specialising in academic articles (CiteULike) and an online journal database (Pubmed). Participant actions were captured using screen capture software and they were asked to describe their search process. Users did make use of tags in their search process, as a guide to searching and as hyperlinks to potentially useful articles. However, users also made use of controlled vocabularies in the journal database to locate useful search terms and of links to related articles supplied by the database.

Good review of the literature, such as it is, on use of user supplied tagging for searching.

Worth reading on the question raised about the use of tags but there is another question lurking in the background.

The authors say in various forms:

The ability to discover useful resources is of increasing importance where web searches return 300 000 (or more) sites of unknown relevance and is equally important in the realm of digital libraries and article databases. The question of the ability to locate information is an old one and led directly to the creation of cataloguing and classification systems for the organisation of knowledge. However, such systems have not proven to be truly scalable when dealing with digital information and especially information on the web.

Since at least 1/3 of the web is pornography and that is not usually relevant to scientific, technical or medical searching, we can reduce the searching problem by 1/3 right there. I don’t know the percentage for shopping, email archives, etc., but when you come down to the “core” literature for field, it really isn’t all that large is it?

Questions:

  1. Do search applications need to “scale” to web size or just enough to cover “core” literature? (discussion)
  2. For library science, how would you go about constructing a list of the “core” literature? (3-5 pages, no citations)
  3. If you use tagging, describe your experience with assigning tags. (3-5 pages, no citations)
  4. If you use tagging for searching purposes, describe your experience (3-5 pages, no citations)

First seen at: ResourceBlog

Google Refine 2.0 – Announcement

Thursday, November 11th, 2010

Google Refine 2.0 has been released.

From the website:

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and data.gov.uk have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch…[screencasts]

If you don’t watch any videos this month, you have to watch http://www.youtube.com/watch?v=m5ER2qRH1OQ!

Google uses the term reconciliation but what is being demonstrated is mapping information to a subject representative.

Note that unlike topic maps, the basis (read properties) for that mapping is not disclosed, so it isn’t possible for a program or person to be sure to repeat the same mapping.

Semantic-Distance Based Clustering for XML Keyword Search

Wednesday, November 10th, 2010

Semantic-Distance Based Clustering for XML Keyword Search Authors(s): Weidong Yang, Hao Zhu Keywords: XML, Keyword Search, Clustering

Abstract:

XML Keyword Search is a user-friendly information discovery technique, which is well-suited to schema-free XML documents. We propose a novel scheme for XML keyword search called XKLUSTER, in which a novel semantic-distance model is proposed to specify the set of nodes contained in a result. Based on this model, we use clustering approaches to generate all meaningful results in XML keyword search. A ranking mechanism is also presented to sort the results.

The author’s develop an interesting notion of “semantic distance” and then say:

Strictly speaking, the searching intentions of users can never be confirmed accurately; so different than existing researches, we suggest that all keyword nodes are useful more or less and should be included in
results. Based on the semantic distance model, we divide the set of keyword nodes X into a group of smaller sets, and each of them is called a “cluster”.

Well…, but the goal is to present the user with results relevant to their query, not results relevant to some query.

Still, an interesting paper and one that XML types will enjoy reading.

Rule Synthesizing from Multiple Related Databases

Tuesday, November 9th, 2010

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

Is 303 Really Necessary? – Blog Post

Thursday, November 4th, 2010

Is 303 Really Necessary?.

Ian Davis details at length why 303′s are unnecessary and offers an interesting alternative.

Read the comments as well.

Guessing Explanations?

Tuesday, October 19th, 2010

Our apparent inability to imagine other audiences keeps nagging at me.

If that is true for hierarchical arrangements, then it must be true for indexes as well.

So far, standard topic maps sort of thinking.

What if that applies to explanations as well?

That is I create better explanations when I imagine the audience to be like me.

And don’t try to guess what others will find a good explanation?

Why not test explanations with audiences?

Make explanation, even of topic maps, a matter of empirical investigation rather than formal correctness.

Using Tag Clouds to Promote Community Awareness in Research Environments

Friday, October 15th, 2010

Using Tag Clouds to Promote Community Awareness in Research Environments Authors: Alexandre Spindler, Stefania Leone, Matthias Geel, Moira C. Norrie Keywords: Tag Clouds – Ambient Information – Community Awareness

Abstract:

Tag clouds have become a popular visualisation scheme for presenting an overview of the content of document collections. We describe how we have adapted tag clouds to provide visual summaries of researchers’ activities and use these to promote awareness within a research group. Each user is associated with a tag cloud that is generated automatically based on the documents that they read and write and is integrated into an ambient information system that we have implemented.

One of the selling points of topic maps has been the serendipitous discovery of new information. Discovery is predicated on awareness and this is an interesting approach to that problem.

Questions:

  1. To what extent does awareness of tagging by colleagues influence future tagging?
  2. How would you design a project to measure the influence of tagging?
  3. Would the influence of tagging change your design of an information interface? Why/Why not? If so, how?

The Linking Open Data cloud diagram

Thursday, September 23rd, 2010

The Linking Open Data cloud diagram is maintained by This page is maintained by Richard Cyganiak and Anja Jentzsch.

I suppose having DBpedia at the center of linked data is better than the CIA Factbook. ;-)

I find large visualizations like this one useful as marketing tools or “that’s cool” examples, but not terribly useful for actual analysis.

Has your experience been different?

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Sunday, September 19th, 2010

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Destined to be a deeply influential resource.

Read the paper, use the application for a week Chem2Bio2RDF, then answer these questions:

  1. Choose three (3) subjects that are identified in this framework.
  2. For each subject, how is it identified in this framework?
  3. For each subject, have you seen it in another framework or system?
  4. For each subject seen in another framework/system, how was it identified there?

Extra credit: What one thing would you change about any of the identifications in this system? Why?

Experience in Extending Query Engine for Continuous Analytics

Sunday, September 5th, 2010

Experience in Extending Query Engine for Continuous Analytics by Qiming Chen and Meichun Hsu has this problem statement:

Streaming analytics is a data-intensive computation chain from event streams to analysis results. In response to the rapidly growing data volume and the increasing need for lower latency, Data Stream Management Systems (DSMSs) provide a paradigm shift from the load-first analyze-later mode of data warehousing….

Moving from load-first analyze-later has implications for topic maps over data warehouses. Particularly when events that are subjects may only have a transient existence in a data stream.

This is on my reading list to prepare to discuss TMQL in Leipzig.

PS: Only five days left to register for TMRA 2010. It is a don’t miss event.

Post Early, Post Often

Thursday, August 19th, 2010

Apologies for the lack of a post for August 18, 2010.

I was working on a post late yesterday evening when my ISP lost connectivity to the Net. :-(

I could not stay up late enough to see if it would be repaired before the end of the day.

Hence, no post for August 18, 2010.

Have a lot of stuff in the queue so will try to get an early post out most days.

PGS – Pretty Good Semantics

Thursday, August 5th, 2010

PGS – Pretty Good Semantics is the result of months of conversation with Sam Hunting.

Our starting premise: Users want to say things of interest to them, as simply as possible, for them.

Note the focus on users. Not on description logic. Not on formal ontologies. Not on reasoning, artificial or otherwise. Not even on complex mappings between identifications. But on users.

All of those other things are worthwhile enterprises, some of them anyway, which you can pursue your own leisure.

The question is how to empower users to say things about what interests them? And if possible, how to do so without re-writing the WWW to deal with 303 clouds, etc. ?

Our answer to those questions: PGS – Pretty Good Semantics. It asks very little of users yet can annotate any identifier on the WWW to say whatever a user likes.

It uses existing HTML techniques and works with existing web servers and search engines.

Enjoy!

Lesson for Topic Maps?

Friday, July 16th, 2010

In an exchange over a MapReduce resource, Robert Barta observed how large that ecosystem has grown in just a year, and suggested there is a lesson for the TM community in that growth. But what lesson is that? (He didn’t say, but I have written to ask.)

“MapReduce” isn’t a cooler a name than “Topic Maps” so that’s not lesson.

MapReduce isn’t less complex than topic maps so that’s not the lesson as well.

Two issues that MapReduce does not face:

  1. Users resisted (and still do resist) markup because it requires making explicit choices about the structure of a text. We learn text structures from users, but for the most part, they are reluctant to name those parts. Is there an analogy to making subjects explicit for a topic map?
  2. If we identify our subjects (our insider vocabulary), then what makes us special will be known by others.

MapReduce doesn’t face the first issue because users can create whatever mapping they wish, without ever saying explicitly what subjects are involved. It also preserves the special nature of insider vocabularies since it has no explicit mechanism for identifying subjects.

Are those the lessons? If they are, are there work arounds? Are there other lessons?

Pragmatic Topic Map Streaming – From Semantic Headache

Tuesday, July 6th, 2010

Pragmatic Topic Map Streaming by Jan Schreiber raises some interesting questions about how to construct a data stream for a topic map.

I particularly like the idea of creating mini-topic maps as it were. See his post for the details.

He did not touch on was how topic map stream software would recognize subjects. A topic map stream creator with a configurable subject recognition would be really useful. Most of us could use the “topic maps subjects” recognition filter while others, interested in dull subjects like the World Cup (just teasing) could have a subject filter for it. Some of us could have both, feeding different topic maps.

iPhone Opportunity for Topic Maps

Sunday, July 4th, 2010

The You Say God Is Dead? There’s an App for That story in the New York Times, July 2, 2010, looks like an opportunity for topic maps.

For publishers, it would be possible to map responses on the basis of topics and let the topic map handle the details of where that is the appropriate response to an “opposing” app. It should shorten the update/production cycle as new material is added to counter new arguments or variations of old ones.

On the product side, publishers could use topic maps to enable users to respond to a variety of ways of naming or phrasing particular issues. In debates over religion, as in all other areas, differences in terminology can make it difficult to come to grips with the opposing side.

Depending on how it was implemented, a topic map app could integrate other resources, ranging from study materials to personal contacts as they relate to this application. Think of a topic map as being able to bridge between data held in mini-silos on an iPhone. So users could add in information into the app that was useful to them in such debates.

Any other critical points I should make as I contact publishers of these apps to recommend topic maps?

*****
PS: Did anyone with an iPhone try out tmjs from Jan Schreiber? I really don’t want to have to buy an iPhone just for that. Help me out here.