Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 3, 2011

The Dow Piano

Filed under: Data Mining,Humor — Patrick Durusau @ 5:32 pm

The Dow Piano

A representation of a year of the DOW Industrial Average as a graph and as musical notes.

At first I was going to post it as something too bizarre to pass up.

But, then it occurred to me that representing data in unexpected ways, such as musical notes, could be an interesting way to explore data.

I am not promising that you will find anything based upon converting stock trades from the various houses into musical notes. But you won’t know unless you look.

What will be amusing is that if and when patterns are found, it will be like the rabbit/duck images, it won’t be possible to see the data without seeing the pattern.

Vowpal Wabbit (Fast Online Learning)

Filed under: Machine Learning,Subject Identity — Patrick Durusau @ 4:01 pm

Vowpal Wabbit (Fast Online Learning) by John Langford.

From the website:

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation.

I rather like that!

Suspect the same is true for subject identity recognition algorithms.

People have fast ones that require little or no programming. 😉

What it will take to replicate such intrinsically fast subject recognition algorithms in digital form remains a research question.

The Haskell Road to Logic, Math and Programming [pdf]

Filed under: Haskell,Logic — Patrick Durusau @ 3:18 pm

The Haskell Road to Logic, Math and Programming [pdf] Authors: Kees Doets and Jan van Eijck

A detailed review can be found at: Book review “The Haskell Road to Logic, Maths and Programming” by Ralf Laemmel.

There are so many “cell phone dead zones” as Newcomb puts it when dealing with semantics that any assistance in clear thinking is welcome.

This is a work the promotes clear thinking.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation – Post

Filed under: Duplicates,Natural Language Processing,Similarity,String Matching — Patrick Durusau @ 3:03 pm

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

Good coverage of tokenization of tweets and the use of the Jaccard Distance measure to determine similarity.

Of course, for a topic map, similarity may not lead to being discarded but trigger other operations instead.

Zotero – Software

Filed under: Bibliography,Marketing,Software — Patrick Durusau @ 2:55 pm

Zotero

I don’t remember now how I stumbled across interesting project.

Looks like fertile ground for the discussion of subject identity.

Particularly since shared bibliographies are nice but merged bibliographies would be better.

Drop in, introduce yourself and topic map thinking about subject identity.

Modeling Social Annotation: A Bayesian Approach

Filed under: Bayesian Models,Data Mining,Tagging — Patrick Durusau @ 2:46 pm

Modeling Social Annotation: A Bayesian Approach Authors: Anon Plangprasopchok, Kristina Lerman

Abstract:

Collaborative tagging systems, such as Delicious, CiteULike, and others, allow users to annotate resources, for example, Web pages or scientific papers, with descriptive labels called tags. The social annotations contributed by thousands of users can potentially be used to infer categorical knowledge, classify documents, or recommend new relevant information. Traditional text inference methods do not make the best use of social annotation, since they do not take into account variations in individual users’ perspectives and vocabulary. In a previous work, we introduced a simple probabilistic model that takes the interests of individual annotators into account in order to find hidden topics of annotated resources. Unfortunately, that approach had one major shortcoming: the number of topics and interests must be specified a priori. To address this drawback, we extend the model to a fully Bayesian framework, which offers a way to automatically estimate these numbers. In particular, the model allows the number of interests and topics to change as suggested by the structure of the data. We evaluate the proposed model in detail on the synthetic and real-world data by comparing its performance to Latent Dirichlet Allocation on the topic extraction task. For the latter evaluation, we apply the model to infer topics of Web resources from social annotations obtained from Delicious in order to discover new resources similar to a specified one. Our empirical results demonstrate that the proposed model is a promising method for exploiting social knowledge contained in user-generated annotations.

Questions:

  1. How does (if it does) a tagging vocabulary different from a regular vocabulary? (3-5 pages, no citations)
  2. Would this technique be application to tracing vocabulary usage across cited papers? In other words, following an author backwards through materials they cite? (3-5 pages, no citations)
  3. What other characteristics do you think a paper would have where the usage of a term had shifted to a different meaning? (3-5 pages, no citations)

January 2, 2011

Why We Desperately Need a New (and Better) Google – Post

Filed under: Search Engines — Patrick Durusau @ 2:23 pm

Why We Desperately Need a New (and Better) Google

Vivek Wadhwa compares Blekko with Google.

The comments on the post, some of them anyway, were as interesting as the post itself.

Questions:

  1. What should Google (or any other search engine) do better in your opinion? (3-5 pages, no citations)
  2. Should new search engines re-index the Internet, or target sub-parts of the Internet? (3-5 pages, no citations)
  3. The unicorn of the Internet, some obscure site with relevant information is mentioned. How serious is the requirement to find every relevant site? If results > 100 go unexamined, what does it matter? (3-5 pages, no citations)
  4. Which would you find more useful: 10 articles relevant to your search topic or > 100 search engine results? (3-5 pages, no citations)

Computational semantics with functional programming

Filed under: Computational Semantics — Patrick Durusau @ 2:21 pm

Computational semantics with functional programming Authors: J van Eijck; Christina Unger, Cambridge University Press, 2010.

I ran across the reference to this volume on J van Eijck’s home page. There are a number of interesting publications listed there.

If you have reading this volume, please post your comments and send me a link.

January 1, 2011

Happy New Year! – Wikileaks

Filed under: Marketing,Topic Maps — Patrick Durusau @ 9:11 pm

Wikileaks has continued posting US diplomatic cables.

The stories read like Extra but with less attractive people.

According to Sec. of State Hillary Clinton, one counterpart said of the Wikileaks’ posting of US diplomatic cables:

Well, don’t worry about it, you should see what we say about you

It’s not clear if they meant Hillary or the United States. 😉

But I’m curious either way. Let’s see diplomatic cables from other countries.

Imagine the Guardian map with cables from multiple countries?

Or mapping relationships between the various people and current government figures?

Or mapping those relationships over careers of toleration or even encouragement of repressive or brutal regimes?

Disclosure of diplomatic cables should be encouraged.

For the same reason disclosure is feared. Accountability.

Such as judging why the United States and other countries have tolerated post WWII episodes of ethnic cleansing.

*****
PS: This would make good subject matter for a public authoring interface to invite contributions from others.

Lots of Copies – Keeps Stuff Safe / Is Insecure – LOCKSS / LOCII

Filed under: Data Reduction,Preservation — Patrick Durusau @ 8:46 pm

Is your enterprise accidentally practicing Lots of Copies Keeps Stuff Safe – LOCKSS?

Gartner analyst Drue Reeves says:

Use document management to make sure you don’t have copies everywhere, and purge nonrelevant material.

If you fall into the lots of copies category your slogan should be: Lots of Copies Is Insecure or LOCII (pronounced “lossee”).

Not all document preservations solutions depend upon being insecure.

Topic maps can help develop strategies to make your document management solution less LOCII.

One way they can help is by mapping out all the duplicate copies. Are they really necessary?

Another way they can help is by showing who has access to each of those copies.

If you trust someone with access, that means you trust everyone they trust.

Check their Facebook or Linkedin pages to see how many other people you are trusting, just by trusting the first person.

Ask yourself: How bad would a Wikileaks like disclosure be?

Then get serious about information security and topic maps.

Data Reduction Technologies: What’s the Difference? – Podcast

Filed under: Data Reduction — Patrick Durusau @ 1:06 pm

Data Reduction Technologies: What’s the Difference?

A podcast I found at seachcio.com on data reduction, otherwise known as data deduplication.

You wouldn’t think that incremental backups would be news but apparently they are. The notion of making a backup incremental after it was a full backup (post-processing in HP speak) was new to me but backup technology isn’t one of my primary concerns.

Or hasn’t been I suppose I should say.

Incremental and de-duping for backups is well explored and probably needs no help from topic maps. At least per se.

The thought occurs to me that topic maps could assist in mapping advanced data reduction that doesn’t simply eliminate duplicate data within a single backup, but eliminates duplicate data across an enterprise.

Now that would be something different and perhaps a competitive advantage for any vendor wanting to pursue it.

Just in sketch form, create a topic map of the various backups and map the duplicate data across the backups. Then a pointer is created to the master backup that has the duplicate data. It could present to a client as though the data exists as part of the backup.

Using topic maps to reduce duplicate data backups, could strengthen encryption by reducing the footprint of data that has been encrypted. The less an encryption is used, the less data there is for analysis.

Those are nice ways to introduce enterprises to the advantages that topic maps can bring to information driven enterprises.

Questions:

  1. Search for data reduction and choose one of the products to review.
  2. Would the suggested use of topic maps to de-dupe backups work for that product? Why/why not? (3-5 pages, citations)
  3. How would you use topic maps to provide a better, in your opinion) interface to standard backups? (3-5 pages, no citations)

PS: Yes, I know, I didn’t define standard backups in #3. Surprise me. Grab some free or demo backup software and report back with screen shots.

« Newer Posts

Powered by WordPress