Archive for February, 2011

A Topic Map Interface

Monday, February 28th, 2011

As promised, a mockup of a topic map interface.

Note that I did not promise a generic topic map interface although this comes pretty close to being generic.

Oh, we need an example of the interface:

Job, Chapter 1

1: There was a man in the land of Uz, whose name was Job; and that man was perfect and upright, and one that feared God, and eschewed evil.
2: And there were born unto him seven sons and three daughters.
3: His substance also was seven thousand sheep, and three thousand camels, and five hundred yoke of oxen, and five hundred she asses, and a very great household; so that this man was the greatest of all the men of the east.
4: And his sons went and feasted in their houses, every one his day; and sent and called for their three sisters to eat and to drink with them.
5: And it was so, when the days of their feasting were gone about, that Job sent and sanctified them, and rose up early in the morning, and offered burnt offerings according to the number of them all: for Job said, It may be that my sons have sinned, and cursed God in their hearts. Thus did Job continually.
6: Now there was a day when the sons of God came to present themselves before the LORD, and Satan came also among them.
7: And the LORD said unto Satan, Whence comest thou? Then Satan answered the LORD, and said, From going to and fro in the earth, and from walking up and down in it.

You should note that each of the verse has a subjectIdentifier prepended to it to enable to reader to quickly locate that verse in a collection of verses.

That identifier also enables the display of other translations of a verse along side the verse in question.

Were this a display of the accepted Hebrew text, of which these verses are a translation, the displayed Hebrew text could act as a gateway to morphological, syntactic (yes, there are differing syntactic parsing of the Hebrew text), links to the latest research, on either a verse, word or structural element basis.

That is what I meant when I said a pull interface.

A pull interface is one where the user and not a programmer, gets to decide what information they wish to see.

For example, say I found the time to practice my Hebrew more than I have for the last 5 or 6 years and so when I mouse-over a Hebrew text, I don’t want a word definition to be displayed but simply its morphological parsing. To act as a hint to me to try to work out from context the meaning of the text.

Contrast that with push models that foist information off onto me whether I would view it or not. Because the developer “knows” what most people want, no doubt by use of Urim and Thummim.

Why not empower users to choose the display (or not) of additional information?

In this particular case, I may choose:

  1. The classic King James translation.
  2. A modern translation.
  3. Several translations in parallel.
  4. The standard Hebrew text.
  5. Morphological or syntactic annotations to the Hebrew text.
  6. Literature annotations to either English/Hebrew text.
  7. Maps or archaeological supplements to the text.

All underlying the text as interface and subject to expansion by a topic map.

When it comes to developers versus users, the long time topic map advocate, Humpty Dumpty would say:

“The question is,” said Humpty Dumpty, “which is to be master ____ that’s all.”

I vote for users. How about you?

PS: BTW, my mockup does all the things I outlined. It doesn’t use JavaScript, Ajax or JQuery to do it but it has had those capabilities long before mechanical assistants appeared on the scene.

What topic maps can add to this interface is a convenience factor and enabling others to more easily bring additional material to my attention, should I choose to view it.

How you wish to enable that use of topic maps is a detail. An important detail but one that should not be confused with or elevated to the same level as successful delivery of content chosen by the user.

YouTube Topic Map?

Monday, February 28th, 2011

Is anyone working or thinking about working on a topic map for YouTube?

I ask because while I can eventually find search terms that will narrow the videos down to a set of lectures, they are disorderly and have duplicates.

If someone is working on a project that would include CS lectures and similar offerings, I would be willing to contribute some editing/sorting of data.

Probably not the most popular subject for a community based topic map. 😉

I might be willing to contribute some editing/sorting of data for more popular topic maps as well. Depends on the topic. (sorry!)

Suggestions (with a link to a representative YouTube video) welcome!

You can even conceal your identity! I won’t out you for liking the sneezing panda video.

RDBMS in the Social Networks Age

Monday, February 28th, 2011

RDBMS in the Social Networks Age by Lorenzo Alberton.

A slide deck that made me wish I had seen the presentation!

Its treatment of graph representation in a relational system is particularly strong.

The bibliography is useful as well.

Just to tempt you into viewing the slide deck, slide 19, The Boring Stuff, is very amusing.

From Search to Discovery

Monday, February 28th, 2011

From Search to Discovery by Tony Russell-Rose.


The landscape of the search industry is undergoing fundamental change. In particular, there is a growing realisation that the true value of search is best realised by embedding it a wider discovery context, so that in addition to facilitating basic lookup tasks such as known-item search and fact retrieval, support is also provided for more complex exploratory tasks such as comparison, aggregation, analysis, synthesis, evaluation, and so on. Clearly, for these sorts of activity a much richer kind of interaction or dialogue between system and end user is required. This talk examines what forms this interactivity might take and discusses a number of principles and approaches for designing effective search and discovery experiences.

Topic map projects looking to develop successful interfaces would do well to heed this presentation.

TeXBlog – Typography with TeX and LaTeX

Monday, February 28th, 2011

TeXBlog – Typography with TeX and LaTeX

I mention this blog for several reasons.

TeX and LaTeX would benefit from the production of a topic map that eased users from less capable systems to a more full featured publication system.

To that extent, this blog would be an excellent starting place for gathering resources for such an effort.

Most of the major academic houses require the use of TeX or LaTeX for publications so if you want to publish about topic maps, that knowledge is a presumed starting point.

Knowledge of TeX and LaTeX will give you an example of how a well designed system can prosper and grow over time. Something to aspire to.

Why Topic Maps? (or schema version n.0)

Monday, February 28th, 2011

A friend of mine forwarded one of the nay-sayer screeds bashing NoSQL implementations in favor of one of the current SQL offerings.

Why that sort of thing is popular remains a mystery to me. I freely grant that some of the NoSQL efforts may be unsuccessful but the effort overall is an interesting one.

And not unlike topic maps when you think about it.

In order to do normalization for a relational database, you have to both know all the subjects you are going to talk about in the database in advance and, more importantly, know how you are going to identify them.

(Yes, there are other, deeper semantic issues with relational databases but this post is for newbies so I won’t cover those here.)

So, what happens if you don’t know all the subjects in advance, how you are going to identify them or even what you want to say about them?

Well, with a relational database, I suppose that is what you can version 2.0 of your database schema, to which you migrate all your data.

And the same is true for relationships between subjects in your database.

Should you decide to add tables for those relationships, well, now you are at version 3.0 of your database schema.

Database versioning or “evolution” I think it is sometimes called, is an entire area of research and software in the database world. I really need to pull some of that together for a piece on how topic maps can help with the documentation aspects of that process.

I started to say that illustrates an advantage of topic maps over relational databases, that the schema does not have to be altered to add new relationships.

And from a certain point of view, it certainly is an advantage.

But, assume we do add a relationship type to a topic map, how do we then version the topic map?

Should we create topics that exist in associations with other topics to add versioning information as part of associations?

Or are there other mechanisms we should consider?

Sorry, did not mean to get side tracked into versioning but it is something that production quality topic maps should take into account.

I don’t think the choices are nearly as stark as SQL vs. NoSQL vs. Topic Maps vs. whatever.

Most information systems have needs that can be meet only with a healthy mixture of solutions.

People who advocate “myStack” solutions are selling just that “myStack” solutions.

As a user/consumer, I prefer “mySolution” stacks. Not exactly the same thing.

Machine Learning – Andrew Ng – YouTube

Monday, February 28th, 2011

The lecture by Andrew Ng that I pointed to in Machine Learning Lectures (Video) on ITunes, are also available on YouTube, in no particular order. I have created an ordered listing of the lectures on YouTube below.

What would be even more useful would be a very short summary/topic listing for each lecture so that additional information could be linked in, dare I say topic mapped?, to create a more useful resource.

No promises but as time permits or as readers contribute, something like that is definitely within the range of possibilities.

Machine Learning – Andrew Ng – Stanford

R Fundamentals and Programming Techniques

Monday, February 28th, 2011

R Fundamentals and Programming Techniques

Thomas Lumley on R.

One of the strengths and weaknesses of the topic map standardization effort was that it presumed you already had a topic map.

A strength because the methods for arriving at a topic map remain unbounded and unsullied by choices (and limitations) of languages, approaches, etc.

A weakness because the topic map novice is left in the position of a tourist who marvels at a medieval cathedral but has no idea how to build one themselves. (Well, ok, perhaps that is a bit of a stretch. 😉 )

The fact remains there is are ever increasing amounts of data becoming available, many of which are just crying out for topic maps to be built for their navigation.

R is one of the currently popular data mining languages that can be pressed into service for the exploration of data and construction of topic maps.

Definitely a resource to explore and exploit before you invest in any of the printed R reference materials.

Working with Trees in the Phyloinformatic Age

Monday, February 28th, 2011

Working with Trees in the Phyloinformatic Age by William H. Piel discusses the processing and display of phylogenetic trees.

The PhyloWidget that is mentioned in the slides can be found at: Phylowidget

If you are working on topic maps in bioinformatics, this should be on your list of works to review.

I was particularly amused by the advice given at the Phylowidget site in its step-by-step instructions:

* If you are a biologist: we recommend starting with the First Step and moving forwards.
* If you are a developer: we recommend starting with the Last Step and moving backwards.

Book of Proof

Monday, February 28th, 2011

Book of Proof by Richard Hammack.

Important for topic maps research and also captures an important distinction that is sometimes overlooked in topic maps.

From the Introduction:

This is a book about how to prove theorems.

Until this point in your education, you may have regarded mathematics as being a primarily computational discipline. You have learned to solve equations, compute derivatives and integrals, multiply matrices and find determinants; and you have seen how these things can answer practical questions about the real world. In this setting, your primary goal in using mathematics has been to compute answers.

But there is another approach to mathematics that is more theoretical than computational. In this approach, the primary goal is to understand mathematical structures, to prove mathematical statements, and even to discover new mathematical theorems and theories. The mathematical techniques and procedures that you have learned and used up until now have their origins in this theoretical side of mathematics. For example, in computing the area under a curve, you use the Fundamental Theorem of Calculus. It is because this theorem is true that your answer is correct. However, in your calculus class you were probably far more concerned with how that theorem could be applied than in understanding why it is true. But how do we know it is true? How can we convince ourselves or others
of its validity? Questions of this nature belong to the theoretical realm of mathematics. This book is an introduction to that realm.

This book will initiate you into an esoteric world. You will learn to understand and apply the methods of thought that mathematicians use to verify theorems, explore mathematical truth and create new mathematical theories. This will prepare you for advanced mathematics courses, for you will be better able to understand proofs, write your own proofs and think critically and inquisitively about mathematics.

Quite legitimately there are topic map activities that are concerned with the efficient application and processing of particular ways to identify subjects and to determine when subject sameness has occurred.

It is equally legitimate to investigate how subject identity is viewed in different domains and the nature of data structures that can best represent those views.

Either one without the other is incomplete.

For those walking on the theoretical side of the street, I think this volume will prove to be quite valuable.

The R Inferno

Sunday, February 27th, 2011

The R Inferno

Take the imitation of the Inferno is taken as an amusing conceit. Something to add to the amusement of the prose.

Just as an example, the second circle is populated in Dante with the carnal, not gluttons. Gluttons await us in the third circle. (At least according to Dante. I haven’t seen the video game.)

So long as it isn’t taken as your guide through Hell, no harm done I suppose.

Let me know about your experiences with it as a guide to R.

You and Your Research

Sunday, February 27th, 2011

You and Your Research

Richard Hamming (yes, that Hamming)


The title of my talk is, “You and Your Research.” It is not about managing research, it is about how you individually do your research. I could give a talk on the other subject– but it’s not, it’s about you. I’m not talking about ordinary run-of-the-mill research; I’m talking about great research. And for the sake of describing great research I’ll occasionally say Nobel-Prize type of work. It doesn’t have to gain the Nobel Prize, but I mean those kinds of things which we perceive are significant things. Relativity, if you want, Shannon’s information theory, any number of outstanding theories– that’s the kind of thing I’m talking about.

I happened upon this quite by accident.

It may be known by every person in involved in topic maps, semantic web, ontology and similar work. Or not.

In any event, I think it is worth pointing out, even as repetition for some of you.

I think it has a great deal of relevance both for the development of topic map software as well as topic maps per se.

When You Hear Hoofbeats… – Post

Saturday, February 26th, 2011

When You Hear Hoofbeats… Bob Carpenter is learning C++ and has no difficulty getting error messages, but is having difficulty discovering the causes of those error messages.

Bob describes his need as one of proceeding from a known error message, to one or more possible causes. A “reverse index” in his words.

I suspect that would be a very useful thing to have for any number of programming languages, not to mention shells, utilities and the like.

It occurs to me that topic maps, via associations, could be the very mechanism that Bob is looking for.

At least in terms of the mechanics, filling it with content would be another matter entirely.

Thoughts on how to structure such a topic map?

Or on how to best organize a project to populate it?

Perhaps harvesting “error messages” from posts/blogs/etc. with pointers back to the same and offering the opportunity to specify the possible cause?

With some recognition mechanism for those contributing the most often recognized by other contributors causes?

Has the issue commonly found in topic map projects:

  1. Data harvesting
  2. Interfaces
  3. Management of the map
  4. Subject Identity
  5. etc.

Thoughts? Suggestions?

Baseball Stats vs. riak map/reduce

Saturday, February 26th, 2011

Baseball Stats vs. riak map/reduce

I saw this at Alex Popescu’s myNoSQL site, with the name: MapReducing Big Data with Riak and Luwak but since the baseball season is already in the news in Atlanta, I thought the other title works better.

Bryan Fink makes effective use of 30 minutes and baseball stats from RETROSHEET to demonstrate the use of riak map/reduce and how it might be applied to other data sets.

Well worth the time.

Experiencing Information – Blog

Saturday, February 26th, 2011

Experiencing Information is a blog by James Kalbach.

Kalbach authored Designing Web Navigation and a number of other publications on information delivery.

I will be mentioning posts by Kalbach that seem to me to be particularly useful for topic map interfaces but commend the blog to you in general.

…a grain of salt

Friday, February 25th, 2011

Benjamin Bock asked me recently about how I would model a mole of salt in a topic map.

That is a good question but I think we had better start with a single grain of salt and then work our way up from there.

At first blush, and only at first blush, do many subjects look quite easy to represent in a topic map.

A grain of salt looks simple to at first glance, just create a PSI (Published Subject Identifier), put that as the subjectIdentifier on a topic and be done with it.

Well…, except that I don’t want to talk about a particular grain of salt, I want to talk about salt more generally.

OK, one of those, I see.

Alright, same answer as before, except make the PSI for salt in general, not some particular grain of salt.

Well,…., except that when I go to the Wikipedia article on salt, Salt, I find that salt is a compound of chlorine and sodium.

A compound, oh, that means something made up of more than one subject. In a particular type of relationship.

Sounds like an association to me.

Of a particular type, an ionic association. (I looked it up, see: Ionic Compound)

And this association between chlorine and sodium has several properties reported by Wikipedia, here are just a few of them:

  • Molar mass: 58.443 g/mol
  • Appearance: Colorless/white crystalline solid
  • Odor: Odorless
  • Density: 2.165 g/cm3
  • Melting point: 801 °C, 1074 K, 1474 °F
  • Boiling point: 1413 °C, 1686 K, 2575 °F
  • … and several others.

    If you are interested in scientific/technical work, please be aware of CAS, a work product of the American Chemical Society, with a very impressive range unique identifiers. (56 million organic and inorganic substances, 62 million sequences and they have a counter that increments while you are on the page.)

    Note that unlike my suggestion, CAS takes the assign a unique identifier view for the substances, sequences and chemicals that they curate.

    Oh, sorry, got interested in the CAS as a source for subject identification. In fact, that is a nice segway to consider how to represent the millions and millions of compounds.

    We could create associations with the various components being role players but then we would have to reify those associations in order to hang additional properties off of them. Well, technically speaking in XTM we would create non-occurrence occurrences and type those to hold the additional properties.

    Sorry, I was presuming the decision to represent compounds as associations. Shout out when I start to presume that sort of thing. 😉

    The reason I would represent compounds as associations is that the components of the associations are then subjects I can talk about and even add addition properties to, or create mappings between.

    I suspect that CAS has chemistry from the 1800’s fairly well covered but what about older texts? Substances before then may not be of interest to commercial chemists but certainly would be of interest to historians and other scholars.

    Use of a topic map plus the CAS identifiers would enable scholars studying older materials to effectively share information about older texts, which have different designations for substances than CAS would record.

    You could argue that I could use a topic for compounds, much as CAS does, and rely upon searching in order to discover relationships.

    Tis true, tis true, but my modeling preference is for relationships seen as subjects, although I must confess I would prefer a next generation syntax that avoids the reification overhead of XTM.

    Given the prevalent of complex relationships/associations as you see from the CAS index, I think a simplification of the representation of associations is warranted.

    Sorry, I never did quite reach Benjamin’s question about a mole of salt but I will take up that gauge again tomorrow.

    We will see that measurements (which figured into his questions about recipes as well) is an interesting area of topic map design.

    PS: Comments and/or suggestions on areas to post about are most welcome. Subject analysis for topic maps is not unlike cataloging in library science to a degree, except that what classification you assign is entirely the work product of your experience, reading and analysis. There are no fixed answers, only the ones that you find the most useful.

    scikits.learn machine learning in Python

    Friday, February 25th, 2011

    scikits.learn machine learning in Python

    From the website:

    Easy-to-use and general-purpose machine learning in Python

    scikits.learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib).

    It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

    This could be a good model for a “learning topic maps” site for people interested in the technical side of topic maps.

    There may not be a real call for training people who aren’t interested in learning the technical side of topic maps.

    By analogy with indexing, lots of folks can use indexes (sorta, ok, I am being generous) but not that many folks can create good indexes.

    I will be posting some examples of wannabe indexes next week.

    Producing Open Source Software

    Friday, February 25th, 2011

    Producing Open Source Software

    I can’t imagine why my digital page turning should have leap to “Handling Difficult People,” but it did. 😉

    Actually just skimming the TOC, this looks like a good book for any open source project.

    My question to you, once you have had a chance to read it, could the title also be:

    Producing Open Source Topic Maps?

    Why/Why Not?

    Seems to me that the topic maps community could be more collaborative than it is.

    I am sure others feel the same way, so why doesn’t it happen more often?

    Twitter Social Graph – Post

    Friday, February 25th, 2011

    Twitter Social Graph

    Timothy M. Kunau covers a couple of Twitter social graph programs.

    I am interested to know what you think of the “zoom” feature?

    It is often touted as given the “high level” view but I wonder how often in fact that is a useful view?

    Has anyone studied that under controlled circumstances or is it matter of anecdote and lore?


    JavaScript InfoVis Toolkit

    Friday, February 25th, 2011

    JavaScript InfoVis Toolkit

    From the website:

    The JavaScript InfoVis Toolkit provides tools for creating Interactive Data Visualizations for the Web.

    Have to wonder if “interactive data visualizations” is one step towards shared exploration/mining of data sets for the construction of topic maps?

    That is how long will it be before we are interacting with each other in the visualizations?


    Learn You Some Erlang For Great Good!

    Friday, February 25th, 2011

    Learn You Some Erlang For Great Good!

    I mention this site as a preface to asking if anyone is working on a topic map implementation in Erlang?

    I suppose I have been influenced by Barta’s arguments for a functional approach.

    In part because what subjects you see and what properties they possess are determined by what you ask to see.

    Not that other approaches have a different answer, they just aren’t as transparent about it.

    It is perfectly fine to have a point of view on the world, just don’t confuse your point of view with the world.
    PS: The message passing capabilities should also be of interest for distributed as well as non-distributed topic map applications. Imagine a proxy that broadcasts the identity of the subject it represents.

    RHIPE: An Interface Between Hadoop and R for Large and Complex Data Analysis

    Friday, February 25th, 2011

    RHIPE: An Interface Between Hadoop and R for Large and Complex Data Analysis

    Enables processing with R across data sets too large to load.

    But, you have to see the video to watch the retrieval from 14 GB of data that had been produced using RHIPE. Or the 145 GB of SSH traffic from the Department of Homeland Security.

    Very impressive.

    Machine Learning Ex2 – Linear Regression – Post

    Friday, February 25th, 2011

    Machine Learning Ex2 – Linear Regression

    A useful exercise on linear regression using R.

    Ender’s Topic Map

    Thursday, February 24th, 2011

    Warning: Spoiler for Ender’s game by Orson Scott Card.*

    After posting my comments on the Maiana interface, in my posting Maiana February Release, I fully intended to post a suggested alternative interface.

    But, comparing end results to end results isn’t going to get us much further than: “I like mine better than yours,” sort of reasoning.

    It has been my experience in the topic maps community that isn’t terribly helpful or productive.

    I want to use Ender’s Game to explore criteria for a successful topic map interface.

    I think discussing principles of interfaces, which could be expressed any number of ways, is a useful step before simply dashing off interfaces.

    Have all the children or anyone who needs to read Ender’s Game left at this point?


    I certainly wasn’t a child or even young adult when I first read Ender’s Game but it was a deeply impressive piece of work.

    Last warning: Spoiler immediately follows!

    As you may have surmised by this point, the main character in the story is name Ender. No real surprise there.

    The plot line is a familiar one, Earth is threatened by evil aliens (are there any other kind?) and is fighting a desperate war to avoid utter destruction.

    Ender is selected for training at Battle School as are a number of other, very bright children. A succession of extreme situations follow, all of which Ender eventually wins, due in part to his tactical genius.

    What is unknown to the reader and to Ender until after the final battle, Ender’s skills and tactics have been simultaneously used as tactics in real space battles.

    Ender has been used to exterminate the alien race.

    That’s what I would call a successful interface on a number of levels.

    Ender’s environment wasn’t designed (at least from his view) as an actual war command center.

    That is to say that it didn’t have gauges, switches, tactical displays, etc. Or at least the same information was being given to Ender, in analogous forms.

    Forms that a child could understand.

    First principle for topic map interfaces: Information must be delivered in a form the user will understand.

    You or I may be comfortable with all the topic map machinery talk-talk but I suspect that most users aren’t.

    Here’s a test of that suspicion. Go up to anyone outside of your IT department and ask the to explain how FaceBook works. Just in general terms, not the details. I’ll wait. 😉

    OK, now are you satisfied that most users aren’t likely to be comfortable with topic map machinery talk-talk?

    Second principle for topic map interfaces: Do not present information to all users the same way.

    The military types and Ender were presented the same information in completely different ways.

    Now, you may object that is just a story but I suggest that you turn on the evening news and listen to 30 minutes of Fox News and then 30 minutes of National Public Radio (A US specific example but I don’t know the nut case media in Europe.).

    Same stories, one assumes the same basic facts, but you would think one or both of them had over heard an emu speaking in whispered Urdu in a crowed bus terminal.

    It isn’t enough to simply avoid topic map lingo but a successful topic map interface will be designed for particular user communities.

    In that regard, I think we have been mis-lead by the success or at least non-failure of interfaces for word processors, spreadsheets, etc.

    The range of those applications is so limited and the utility of them for narrow purposes is so great, that they have succeeded in spite of their poor design.

    So, at this point I have two principles for topic map interface design:

    • Information must be delivered in a form the user will understand.
    • Do not present information to all users the same way.

    I know, Benjamin Bock, among others, is going to say this is all too theoretical, blah, blah.

    Well, it is theoretical but then so is math but banking, which is fairly “practical,” would break down without math.


    Actually I have an idea for an interface design that at least touches on these two principles for a topic map interface.

    Set your watches for 12:00 (Eastern Time US) 28 February 2010 for a mockup of such an interface.

    *(Wikipedia has a somewhat longer summary, Ender’s Game.)

    PS: More posts on principles of topic map interfaces to follow. Along with more mockups, etc. of interfaces.

    How useful any of the mockups prove to be, I leave to your judgment.

    Machine Learning for .Net

    Thursday, February 24th, 2011

    Machine Learning for .Net

    From the webpage:

    This library is designed to assist in the use of common Machine Learning Algorithms in conjunction with the .NET platform. It is designed to include the most popular supervised and unsupervised learning algorithms while minimizing the friction involved with creating the predictive models.

    Supervised Learning

    Supervised learning is an approach in machine learning where the system is provided with labeled examples of a problem and the computer creates a model to predict future unlabeled examples. These classifiers are further divided into the following sets:

    • Binary Classification – Predicting a Yes/No type value
    • Multi-Class Classification – Predicting a value from a finite set (i.e. {A, B, C, D } or {1, 2, 3, 4})
    • Regression – Predicting a continuous value (i.e. a number)

    Unsupervised Learning

    Unsupervised learning is an approach which involves learning about the shape of unlabeled data. This library currently contains:

    1. KMeans – Performs automatic grouping of data into K groups (specified a priori)

      Labeling data is the same as for the supervised learning algorithms with the exception that these algorithms ignore the [Label] attribute:

      1. var kmeans = new KMeans();
      2. var grouping =
        kmeans.Generate(ListOfStudents, 2);

      Here the KMeans algorithm is grouping the ListOfStudents into two groups returning an array corresponding to the appropriate group for each student (in this case group 0 or group 1)

    2. Hierarchical Clustering – In progress!
    3. Planning

      Currently planning/hoping to do the following:

      1. Boosting/Bagging
      2. Hierarchical Clustering
      3. Naïve Bayes Classifier
      4. Collaborative filtering algorithms (suggest a product, friend etc.)
      5. Latent Semantic Analysis (for better searching of text etc.)
      6. Support Vector Machines (more powerful classifier)
      7. Principal Component Analysis – Aids in dimensionality reduction which should allow/facilitate learning from images
      8. *Maybe* – Common AI algorithms such as A*, Beam Search, Minimax etc.

    So, if you are working in a .Net context, here is a chance to get in on the ground floor of a machine learning project.

    Cassandra’s data model as records and lists – Post

    Thursday, February 24th, 2011

    Cassandra’s data model as records and lists

    From the post:

    I have to admit I’ve never really been happy with Cassandra’s data model, or to be more precisely, I’ve never really been with my understanding of the model. However I’ve realized that if we think of two use cases for column families then things may become a bit clearer. For me, Column families can be used in one of two ways, either as a record or an ordered list.

    I thought it was helpful, what do you think?

    Challenge to the Opera Topic Map?

    Thursday, February 24th, 2011

    Well, not quite. Needs topic mapping step but…, you are a little closer that before.

    Data mining & Hip Hop reports:

    Tahir Hemphil data mined 30 years of hip-hop lyrics to provide a searchable index of the genre’s lexicon.

    The project analyzes the lyrics of over 40,000 songs for metaphors, similes, cultural references, phrases, memes and socio-political ideas.[Project] The project is one of its kind with a huge potential offering to the hip hop world, not only can you visualize the artists career’s but also have deeper analysis into their world where you can potential patternize their music.

    See the post for more material and links.

    ICWSM 2011 Data Challenge

    Thursday, February 24th, 2011

    ICWSM 2011 Data Challenge

    From the website:

    The ICWSM 2011 Data Challenge introduces a brand-new dataset, the 2011 ICWSM Spinn3r dataset. This dataset includes blogs from Spinn3r over a 33 day period, from January 13th, 2011 through February 14th, 2011. See here for details on how to obtain the collection.

    Since the new collection spans some rather extraordinary world events, this year introduces a specific task: to locate significant posts in the collection which are relevant to the revolutions in Tunisia and Egypt. The criterion for “significant relevance” is that the post is worthy of being shared by you, an observer, with a friend. To participate in the task, we will ask that you submit a ranked list of items in the collection, and we will do some form of relevance judgments and scoring in time for the conference.

    The data challenge will culminate at ICWSM 2011 with a special workshop. To participate in the workshop, you must submit a 3-page short paper in PDF format and bring a poster to present at the workshop. The short papers will not be reviewed, but the workshop organizers will select a small panel of speakers based on the submissions. The short paper/poster can describe your participation in the shared task, OR ALTERNATIVELY other compelling work you have performed WITH THE 2011 DATASET.

    Submissions will be due on April 22, 2011. Details on the submission process will be posted soon.

    Oh, just briefly about the collection:

    The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see for a more detailed list of events spanning the dataset’s time period).

    If you are going to be in Barcelona (the conference location), why not submit an entry using topic maps?

    Maiana February Release

    Wednesday, February 23rd, 2011

    Maiana February Release

    Uta Schulze and Michael Prilop report:

    Today, the Maiana team released the first 2011 version of Maiana. After forgoing the January release because of open issues we are now presenting Maiana in a partly new design – including pagination and catchy boxes. Most important: Maiana switched to TMQL4J v. 3.1.0-SNAPSHOT supporting the draft of 2008.

    • New design: Check out our fresh new design to present data in a topic map (e.g. the Opera Topic Map). Do you like it? We also extended the ‘Overview’ box to summarize every action available to the topic map. Last but not least, we added pagination to avoid long load time and vertical scrolling.
    • New TMQL4J version: We now run on TMQL4J v. 3.1.0-SNAPSHOT (see TMQL4J 3.0.0 Release News for more information). Currently, only queries compliant to the TMQL draft of 2008 are supported.
    • Donating queries: TMQL and SPARQL queries which until now could only be used privately may now be set public. Thus enabling sharing of queries or simply promoting a interesting query result. An overview of queries may be found on the users corresponding profile page.
    • Syntax Highlighting of TMQL Queries: And because reading queries is difficult as it is we now use syntax highlighting displaying queries and even some autocompletion (try typing “FOR”)
    • “More about this subject”: To enhance the browsing experience we now look up additional information whilst visiting a topic page. This expands our use of the Semantic Search to also show topics available in visible maps on Maiana and even providing similarity information (Opera).
    • Maiana Search Provider: Do you like using your browsers search field? Try adding Maiana as a search provider and find more, faster!

    The new Maiana homepage,

    With Opera,

    Comments on the new interface?

    I think the color scheme makes it more readable, something I appreciate more and more the older I get. 😉

    Beyond that…, well, I have to confess the topic map navigation interface doesn’t do a lot for me.

    I think because it seems to me, personal opinion, to emphasize the machinery of the topic map at the expense of the content.

    Hmmm, think of it this way, what if you had the display of an ODF based document and it listed:

      <h> elements


      3 Document Structure
      3.1 Document Representation
      3.1.1 General


      <text:p> elements

      <draw:object> elements


    It would still be readable (sorta) but not exactly an enjoyable experience.

    Let me leave you to think about the machinery-in-front approach versus what you would suggest as an alternative.

    I have a suggested approach that I will be posting tomorrow.

    (Hint: It is more of a pull than push information model for the interface. That maybe what is bothering me about the default topic map interface, that it is a push model.)

    Berlin SPARQL Benchmark (BSBM)

    Wednesday, February 23rd, 2011

    Berlin SPARQL Benchmark (BSBM)

    From the website:

    The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open web settings. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores, Named Graph stores, systems that map relational databases into RDF, and SPARQL wrappers around other kinds of data sources.

    The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix illustrates the search and navigation pattern of a consumer looking for a product.


    02/22/2011: Results of the February 2011 BSBM V3 Experiment released, benchmarking Virtuoso, BigOWLIM, 4store, BigData and Jena TDB with 100 million and 200 million triples datasets within the Exploreand Update use cases

    Serious sized bench marking files.

    Do wonder how diverse the file content is compared to content in the “wild” so to speak?