## Archive for May, 2011

### Balisage 2011 Call for Late-breaking News

Tuesday, May 31st, 2011

Balisage 2011 Call for Late-breaking News

The call for late-breaking news for Balisage has gone out!

Due 10 June 2011!

Submit a paper if:

• You have plane tickets for Montreal but are not (yet) attending a conference
• You are fleeing to Canada early to avoid the draft avoidance rush and need a cover story
• You want to go to that great Greek restaurant just down from the Europa
• You heard about St. Catherine’s but just don’t believe the reports.
• You are attending Balisage and haven’t submitted a paper, yet

Oh, I almost forgot, the conference organizers have some suggestions on entering the fierce competition for a late-breaking paper:

a) really late-breaking (it reports on something that happened in the last month or two) or

b) a well-developed paper, an extended paper proposal, or a very long abstract with references on a topic related to Markup and not already on the 2011 conference program.

Try Balisage. It is the best markup conference of the year. (full stop)

### Biomedical Annotation…Webinar 1 June 2011 – 10 AM PT (17:00 GMT)

Tuesday, May 31st, 2011

Biomedical Annotation by Humans and computers in a Keyword-driven world

From the website:

Abstract:

As part of our project with the NCBO we have been curating expression experiments housed in NCBI’s GEO data base and annotating a variety of rat-related records using the NCBO Annotator and more recently, mining data from the NCBO Resource Index. The annotation pipelines and curation tools that we have built have demonstrated some strengths and shortfalls of automated ontology annotation. Similarly our manual curation of these records highlights areas where human involvement could be improved to better address the fact that we are living in the Google era where findability is King.

Speaker Bio:

Simon Twigger currently splits his time between being an Assistant Professor in the Human and Molecular Genetics Center at the Medical College of Wisconsin in Milwaukee and exploring the iPhone and iPad as mobile platforms for education and interaction. At MCW he has been an investigator on the Rat Genome Database project for the past 10 years, he worked with the Gene Ontology project and has been active in the BioCuration community as co-organizer of the past three International BioCuration meetings. He is the former Director of Bioinformatics for the MCW Proteomics Center and was previously the Biomedical Informatics Key Function Director for the MCW Clinical & Translational Science Institute. He is a Semantic web enthusiast and is eagerly awaiting the rapture of Web 3.0 when all the data will be taken up into the Linked Data cloud and its true potential realized.

Annotation, useful annotation anyway, is based on recognition of the subject of annotation. Should prove to be an interesting presentation.

Notes from the webinar:

(My personal notes while viewing the webinar in real time. The webinar controls in all cases of conflict. Posted to interest others in viewing the stored version of the webinar.)

Rat Genome Database: http://rgd.mcw.edu / interesting questions that researchers ask / Where to find answers, PubMed 20 million+ citations, almost 1 per minute / search is the critical thing – in all interfaces / “Being able to find information is of great importance to researchers.” / NCBO Annotator www.bioontology.org/wiki/index.php/Annotator_Web_service / records annotated – curated the raw annotations – manual effort needed to track it down – / rat strain synonyms has issues / work flow description / mouse gut maps to course (ex. of mapping issue) / Linking annotations to data / RatMine faceted-search + lucene text indexing , interesting widgets / – Driving “Biological” Problem Part 2 – 55.6 % of researchers rarely use archival databases, 56.0% rarely use published literature / 3rd International biocurator meeting Amos Bairoch – “trying to second guess what the authors really did and found.” / post-publication effort to make content be found. different from academic model where publication simply floats along. / illustration of where the annotation path fails and the consequences of that failure. / very cool visualization of how annotations can be visualized and the value thereof / put in keywords and don’t care about it being found (paper) , NCBO Resource Index could be a “semantic warehouse” of connections, websites: gminer.mcw.edu, github.com/mcwbbc/, bioportal.bioontology.org, simont -at- mcw.edu @simon_t

### Announcing Neo4j 1.4 M03 “Karuna Stol”

Tuesday, May 31st, 2011

Announcing Neo4j 1.4 M03 “Karuna Stol”

From the release post:

Today [Friday, May 27, 2011] marks our third milestone in the Neo4j 1.4 releases. We’ve spent the time since our last release listening to the community and adding to our APIs to help make working with the database even easier and more productive. Under the covers we’ve also built in some performance enhancements that we think you’ll appreciate. And our eye-candy, which you know as Webadmin, has also been extended and tweaked.

Quickly:

• self-loops are allowed
• interact (create/delete) indexes in WebAdmin
• caching relationships
• HA less confusing

### Semantic Web Dog Food (There’s a fly in my bowl.)

Monday, May 30th, 2011

Semantic Web Dog Food

From the website:

Welcome to the Semantic Web Conference Corpus – a.k.a. the Semantic Web Dog Food Corpus! Here you can browse and search information on papers that were presented, people who attended, and other things that have to do with the main conferences and workshops in the area of Semantic Web research.

• 2133 papers,
• 5020 people and
• 1273 organisations at
• 20 conferences and
• 132 workshops,

and a total of 126886 unique triples in our database!

The numbers looked low to me until I read in the FAQ:

This is not just a site for ISWC [International Semantic Web Conference] and ESWC [European Semantic Web Conference] though. We hope that, in time, other metadata sets relating to Semantic Web activity will be hosted here — additional bibliographic data, test sets, community ontologies and so on.

This illustrates a persistent problem of the Semantic Web. This site has one way to encode the semantics of these papers, people, conferences and workshops. Other sources of semantic data on these papers, people, conferences and workshops may well use other ways to encode those semantics. And every group has what it feels are compelling reasons for following its choices and not the choices of others. Assuming they are even aware of the choices of others. (Discovery being another problem but I won’t talk about that now.)

The previous semantic diversity of natural language is now represented by a semantic diversity of ontologies and URIs. Now our computers can more rapidly and reliably detect that we are using different vocabularies. The SW seems like a lot of work for such a result. Particularly since we continue to use diverse vocabularies and more diverse vocabularies continue to arise.

The SW solution, using OWL Full:

5.2.1 owl:sameAs

The built-in OWL property owl:sameAs links an individual to an individual. Such an owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same “identity”.

For individuals such as “people” this notion is relatively easy to understand. For example, we could state that the following two URI references actually refer to the same person:

<rdf:Description rdf:about="#William_Jefferson_Clinton"> <owl:sameAs rdf:resource="#BillClinton"/> </rdf:Description>

The owl:sameAs statements are often used in defining mappings between ontologies. It is unrealistic to assume everybody will use the same name to refer to individuals. That would require some grand design, which is contrary to the spirit of the web.

In OWL Full, where a class can be treated as instances of (meta)classes, we can use the owl:sameAs construct to define class equality, thus indicating that two concepts have the same intensional meaning. An example:

<owl:Class rdf:ID="FootballTeam"> <owl:sameAs rdf:resource="http://sports.org/US#SoccerTeam"/> </owl:Class>

One could imagine this axiom to be part of a European sports ontology. The two classes are treated here as individuals, in this case as instances of the class owl:Class. This allows us to state that the class FootballTeam in some European sports ontology denotes the same concept as the class SoccerTeam in some American sports ontology. Note the difference with the statement:

<footballTeam owl:equivalentClass us:soccerTeam /> 

which states that the two classes have the same class extension, but are not (necessarily) the same concepts.

Anyone see a problem? Other than requiring the use of OWL Full?

The absence of any basis for “…denotes the same concept as….?” I can’t safely reuse this axiom because I don’t know on what basis its author made such a claim. The URIs may provide further information that may satisfy me the axiom is correct but that still leaves me in the dark as to why the author of the axiom thought it to be correct. Overly precise for football/soccer ontologies you say but what of drug interaction ontologies? Or ontologies that govern highly sensitive intelligence data?

So we repeat semantic diversity, create maps to overcome the repeated semantic diversity and the maps we create have no explicit basis for the mappings they represent. Tell me again why this was a good idea?

### Databases For Machine Learning Experiments

Monday, May 30th, 2011

Databases For Machine Learning Experiments

From the website:

An experiment database is a database designed to store learning experiments in full detail, aimed at providing a convenient platform for the study of learning algorithms.

By submitting all details about the learning algorithms, datasets, experimental setup and results, experiments can be easily reproduced and reused in further studies.

By querying and mining the database, it allows easy, thorough analysis of learning algorithms while providing all information to correctly interpret the results.

To get a first idea, watch the video tutorial (updated!) of our explorer tool. Or start querying online by looking at some examples!

Video tutorials:

Experiment Database for Machine Learning Tutorial – SQL Querying

Experiment Database for Machine Learning Tutorial – Video Querying

Very interesting site. Wondering how something similar could be done to illustrate the use of topic maps?

Has anyone used this in connection with a class on machine learning?

### Licensing Open Data: A Practical Guide

Monday, May 30th, 2011

Licensing Open Data: A Practical Guide

I would take seriously its suggestion to seek legal counsel if you have any doubts about data you want to use. IP (intellectual property) in any country is a field unto itself and international IP is even more complicated. Self-help, despite all the raging debates about licensing terms and licenses by non-lawyers, is not recommended.

Should not be a problem so long as you are using IP of a client for that client. Is a problem when you start using data from a variety of sources, some of which may not appreciate your organization of the underlying data. Or the juxtaposition of their data with other data, which places them in an unflattering light.

### Social Data on the Web (SDoW2011)

Monday, May 30th, 2011

Social Data on the Web (SDoW2011)

Important Dates:

Submission deadline: Aug 15, 2011 (23:59 pm Hawaii time, GMT-10)
Notification of acceptance: Sep 05, 2011
Camera-ready paper submission: Sep 15, 2011
Workshop: Oct 23/24, 2011

From the website:

Aim and Scope

The 4th international workshop Social Data on the Web (SDoW2011) co-located with the 10th International Semantic Web Conference (ISWC2011) aims to bring together researchers, developers and practitioners involved in semantically-enhancing social media websites, as well as academics researching more formal aspect of these interactions between the Semantic Web and Social Web.

It is now widely agreed in the community that the Semantic Web and the Social Web can benefit from each other. One the one hand, the speed at which data is being created on the Social Web is growing at exponential rate. Recent statistics showed that about 100 million Tweets are created per day and that Facebook has now 500 million users. Yet, some issues still have to be tackled, such as how to efficiently make sense of all this data, how to ensure trust and privacy on the Social Web, how to interlink data from different systems, whether it is on the Web or in the enterprise, or more recently, how to link Social Network and sensor networks to enable Semantic Citizen Sensing.

Prior Proceedings:

SDoW2008

SDoW2009

SDoW2010

### Exploring NYT news and its authors

Sunday, May 29th, 2011

Exploring NYT news and its authors

To say this project/visualization is clever is an understatement!

Completely inadequate description but the interface constructs a mythic “single” reporter on any topic you choose from stories in the New York Times. The interface also gives you reporters who wrote stories on that topic. You can then find what “other” stories the mythic one reporter wrote, as well as compare the stories written by actual NYT reporters.

A project of the IBM Center for Social Software, see: NYTWrites: Exploring The New York Times Authorship.

### Pew Research raw survey data now available

Sunday, May 29th, 2011

Pew Research raw survey data now available

Actually the data sets pointed to by FlowingData are part of the Pew Internet (Pew Internet & American Life Project).

For all Pew raw data sets, see: Pew Research Center The Databank

Data is available in the following formats:

1. Raw survey data file in both SPSS and comma-delimited (.csv) formats. To protect the privacy of respondents, telephone numbers, county of residence and zip code have been removed from all public data files.
2. Cross tabulation file of questions with basic demographics in Word format. Standard demographic categories include sex, race, age, household income, educational attainment, parental status and geographic location (i.e. urban/rural/suburban).
3. Survey instrument/questionnaire in Word format. The survey questionnaire provides question and response labels for the raw data file. It also includes all interviewer prompts and programming filters for outside researchers who would like to see how our questions are constructed or use our questions in their own surveys.
4. Topline data file in Word format that includes trend data to previous surveys in which we have asked each question, where applicable.

As far as I know, the use of topic maps with survey and other data to create “profiles” of particular communities remains unexplored. May not be able to predict the actions of any individual but probabilistic predictions about members of a group may be close enough. Interesting. Predicting the actions of any individual may be NP-Hard but also irrelevant for most purposes.

### Visualization contests around the corner

Saturday, May 28th, 2011

Visualization contests around the corner

Several visualization contests summarized at FlowingData.

The Hacking Education looks really interesting. This caught my eye:

Reinvent the classroom project discovery experience to provide more serendipity, personalization, or casual exploration. (Etsy has more than five different ways to browse through their inventory.)

but there were other equally interesting suggested projects. The serendipity, personalization, or casual exploration line reminded me of topic maps.

Please post something here if you decide to enter. I am sure others will be interested.

Saturday, May 28th, 2011

From the website:

In the realm of public domain software for record linkage and unduplication (aka. dedupe software), The Link King reigns supreme. The Link King has fashioned a powerful alliance between sophisticated probabilistic record linkage and deterministic record linkage protocols incorporating features unavailable in many proprietary record linkage programs. (detailed overview (pdf))

The Link King’s probabilistic record linkage protocol was adapted from the algorithm developed by MEDSTAT for the Substance Abuse and Mental Health Services Administration’s (SAMHSA) Integrated Database Project. The deterministic record linkage protocols were developed at Washington State’s Division of Alcohol and Substance Abuse for use in a variety of evaluation and research projects.

Looks very interesting but requires an SAS “base license.”

I don’t have pricing information for an SAS “base license.”

### The Science and Magic of User and Expert Feedback for Improving Recommendations

Friday, May 27th, 2011

The Science and Magic of User and Expert Feedback for Improving Recommendations by Dr. Xavier Amatriain (Telefonica).

Abstract:

Recommender systems are playing a key role in the next web revolution as a practical alternative to traditional search for information access and filtering. Most of these systems use Collaborative Filtering techniques in which predictions are solely based on the feedback of the user and similar peers. Although this approach is considered relatively effective, it has reached some practical limitations such as the so-called Magic Barrier. Many of these limitations strive from the fact that explicit user feedback in the form of ratings is considered the ground truth. However, this feedback has a non-negligible amount of noise and inconsistencies. Furthermore, in most practical applications, we lack enough explicit feedback and would be better off using implicit feedback or usage data.

In the first part of my talk, I will present our studies in analyzing natural noise in explicit feedback and finding ways to overcome it to improve recommendation accuracy. I will also present our study of user implicit feedback and an approach to relate both kinds of information. In the second part, I will introduce a radically different approach to recommendation that is based on the use of the opinions of experts instead of regular peers. I will show how this approach addresses many of the shortcomings of traditional Collaborative Filtering, generates recommendations that are better perceived by the users, and allows for new applications such as fully-privacy preserving recommendations.

Chris Anderson: “We are leaving the age of information and entering the age of recommendation.”

I suspect Chris Anderson must not be an active library user. Long before recommender systems, librarians have been making recommendations to researchers, patrons and children doing homework. I would say we are returning to the age of librarians, assisted by recommender systems.

Librarians use the reference interview so that based on feedback from patrons they can make the appropriate recommendations.

If you substitute librarian for “expert” in this presentation, it becomes apparent the world of information is coming back around to libraries and librarians.

Librarians should be making the case, both in the literature but to researchers like Dr. Amatriain, that librarians can play a vital role in recommender systems.

This is a very enjoyable as well as useful presentation.

For further information see:

http://xavier.amatriain.net

http://technocalifornia.blogspot.net

### Riak Core: Dynamo Building Blocks

Friday, May 27th, 2011

Riak Core: Dynamo Building Blocks

Highly recommended!

Summary:

Andy Gross discusses the design philosophy behind Riak based on Amazon Dynamo – Gossip Protocol, Consistent Hashing, Vector clocks, Read Repair, etc. -, overviewing its main features and architecture.

Amazon’s Dynamo paper:

One of the more intriguing slide represented http/apps/dbs as a stack to show that while scaling of the http layer is well-known, scaling of apps is more difficult but still doable, the scaling of storage is the most expensive and difficult.

I mention that because scaling of databases I suspect has a lot in common with scaling of topic maps.

On the issue of consistency, the point was made that “expires” can be included in HTTP headers, which indicate a fact is good until some time. I wonder, could a topic have a “last merged” property? So that a user can choose the timeliness they need? So that “last merged” 7 days ago is public information, “last merged” 3 days ago is subscriber information and the most recent “last merged” is premium information.

For example, instead of trying to regulate insider trading, the SEC could create a topic map of stocks and sell insider trading information, suitably priced to keep its “insider” character, except that for enough money, anyone could play. The SEC portion of the subscription + selling price could be used to finance other enforcement activities.

This presentation plus the Amazon paper make nice weekend reading/viewing.

### Zanran

Friday, May 27th, 2011

Zanran

A search engine for data and statistics.

I was puzzled by results containing mostly PDF files until I read:

Zanran doesn’t work by spotting wording in the text and looking for images – it’s the other way round. The system examines millions of images and decides for each one whether it’s a graph, chart or table – whether it has numerical content.

Admittedly you may have difficulty re-using such data but finding it is a big first step. You can then contact the source for the data in a more re-usable form.

From Hints & Helps:

Language. English only please… for now.
Phrase search. You can use double quotes to make phrases (e.g. “mobile phones”).
Vocabulary. We have only limited synonyms – please try different words in your query. And we don’t spell-check … yet.

From the website:

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Put more simply: Zanran is Google for data.

Well said.

### GAMIFY – SETI Contest

Friday, May 27th, 2011

GAMIFY – SETI Contest

From the webpage:

Are you a gamification expert[1] or interested in becoming one? Want to help solve a problem of epic proportions that could have a major impact on the world?

The SETI Institute and Gamify[2] together have created an EPIC Contest to explore possible ways to gamify SETI. We’re asking the most brilliant Earthlings to come up with ideas on how to apply gamification[3] to increase participation in the SETI program.

The primary goal of this social/scientific challenge is to help SETI empower global citizens to participate in the search for cosmic company and to help SETI become financially sustainable so it can live long and prosper. This article explains our problem and what we are looking to accomplish. We invite everyone to answer the question, “How would you gamify SETI?”.

To be more specific:

• Can we create a fun and compelling app or set of apps that allow people to aid us in identifying signals?
• Do you have any ideas to make this process a fun game, while also solving our problem, by applying game mechanics and game-thinking?
• Can we incorporate sharing and social interaction between players?
• Is monetization possible through virtual goods, “status short-cuts” or other methods popularized by social games?
• Are there any angles of looking at the problem and gamifying that we have not thought of?

The scientific principles involved in this field of science can be very complicated. A conscious attempt has been made to explain the challenge we face with a minimum of scientific explanation or jargon. We wish to be able to clearly explain our unique problems and desired outcomes to the scientific and non-scientific audience.

….

You will see from the presentations at Web 2.0 Expo SF 2011 that gamification is a growing theme in UI development.

I mention this because:

1. Gamification has the potential to ease the authoring and use(?) of topic maps.
2. SETI is an complex and important project and so a good proving ground for gamification.
3. Insights found here maybe applicable to more complex data, like texts.

### Web 2.0 Expo SF 2011

Friday, May 27th, 2011

Web 2.0 Expo SF 2011

Presentations and in many cases slides from the Web 2.0 Expo, March 28-31, 2011.

Just scanning the titles of the presentations, I would suggest sending this link to your UI team. There are a number of presentations that will give your UI team ideas for a successful interface.

Having a great topic map engine isn’t enough. Nor is having great content enough. Users have to like using your interface.

If there are any presentations that you find particularly helpful, please mention them in a comment.

### How Graph Databases Can Make You a Superstar

Thursday, May 26th, 2011

How Graph Databases Can Make You a Superstar by Andrés Taylor.

Nothing new if you are already using graph databases.

But, some highly amusing slides and illustrations of why graph databases are important. Would be useful to re-use rather than re-inventing these slides.

Trees are the lamest of graphs.

As an overlapping markup person, I wanted to dance in the streets!

### Google Correlate & Party Games

Thursday, May 26th, 2011

A new service from Google. From the blog entry:

It all started with the flu. In 2008, we found that the activity of certain search terms are good indicators of actual flu activity. Based on this finding, we launched Google Flu Trends to provide timely estimates of flu activity in 28 countries. Since then, we’ve seen a number of other researchers—including our very own—use search activity data to estimate other real world activities.

However, tools that provide access to search data, such as Google Trends or Google Insights for Search, weren’t designed with this type of research in mind. Those systems allow you to enter a search term and see the trend; but researchers told us they want to enter the trend of some real world activity and see which search terms best match that trend. In other words, they wanted a system that was like Google Trends but in reverse.

This is now possible with Google Correlate, which we’re launching today on Google Labs. Using Correlate, you can upload your own data series and see a list of search terms whose popularity best corresponds with that real world trend. In the example below, we uploaded official flu activity data from the U.S. CDC over the last several years and found that people search for terms like [cold or flu] in a similar pattern to actual flu rates…

One use Google Correlate would be party games to guess the correlated terms.

I looked at the “rainfall” correlation example.

For “annual rainfall (in) |correlate| disney vacation package,” I would have guessed “prozac” and not “mildew remover.” Shows what I know.

I am sure topic map authors have other uses for these Google tools. What are yours?

### Something Completely Different: A useful deliverable?

Thursday, May 26th, 2011

Breaking with long-standing, respected and near holy traditions of conference workshops with jet-lagged, caffeine-jagged, email-reading, passive-aggressives half-listening to speakers, who are not reading their email, Balisage pre-conference workshop will focus on creation of a useful deliverable.

For the topic: Document Oriented XML: Identifying Attainable Expectations

Important dates:

Balisage
Symposium August 1, 2011
Conference August 2-5, 2011

From the announcement:

This year after a short introduction to the topic, the goal, and the approach, the attendees will break out into work groups with writing assignments and will actively participate in the development of a white paper. As the day progresses, groups will work on assignments, report back to the whole, and receive new assignments.

The notes, text, lists, and stories created during the workshop will be turned over to an editor who will produce a White Paper from the work produced during the workshop.

We expect this to be an intense, interactive, and productive day.

Participate if you want to:

• help draft a document that will meet the needs of many
• influence the direction and content of this document
• learn what some others think
• work elbow to elbow with XMLers of different backgrounds for a day
• throw yourself into an interactive group activity.

If this does not sound like your sort of day, if you are more comfortable in a more traditional conference environment, please join us for Balisage: The Markup Conference 2011, starting the following day.

Registration Information: http://www.balisage.net/registration.html

Details on the Symposium: http://www.balisage.net/interchange/
Details on Balisage: The Markup Conference: http://www.balisage.net

August, Montreal, Balisage, markup folks, what more could you want?

### TransportDublin.ie

Thursday, May 26th, 2011

TransportDublin.ie

Neo4J powers this trip planner for a magical city.

Question: If this map is shown on an iPhone and can display more information about either the starting or ending location, is that a fragment?

I ask because it would be less than all the information the source contains. Which is one sense of “fragment.”

I don’t know that this representation (yet) can do that, but the delivery of the route made me think about the information being delivered. It is in some very real sense “complete” for purposes of navigating about Dublin. If I ask again, I will get another “complete” information set. And I have no trouble seeing relationships between those two sets of information.

### 24th OpenMath Workshop

Thursday, May 26th, 2011

24th OpenMath Workshop
Bertinoro, Italy
July 20, 2011
co-located with CICM 2011
Continuous submission until July 10

From the post with the announcement (the link at the CICM site is broken, as of 24 May 2011)

OBJECTIVES

With the release of the MathML 3 W3C recommendation, OpenMath enters a new phase of its development. Topics we expect to see at the workshop include

• Feature Requests (Standard Enhancement Proposals) and Discussions for OpenMath3
• Convergence of OpenMath and MathML 3
• Reasoning with OpenMath
• Software using or processing OpenMath
• New OpenMath Content Dictionaries

though others related to OpenMath are certainly welcomed. For examples of contributions see the 22nd OpenMath Workshop of 2009 (http://staff.bath.ac.uk/masjhd/OM2009.html#contributions).

Contributions can be either full research papers, Standard Enhancement Proposals, or a description of new Content Dictionaries, particularly ones that are suggested for formal adoption by the OpenMath Society.

IMPORTANT DATES (all times are GMT)

OpenMath 2011 does not have a submission deadline. Submissions will be accepted until July 10 and reviewed and notified continuously.

SUBMISSIONS

Submission is by e-mail to omws2011@googlegroups.com. Papers must conform to the Springer LNCS style, preferably using LaTeX2e and the Springer llncs class files.

Submission categories:

• Full paper: 4-12 LNCS pages
• Short paper: 1-8 LNCS pages
• CD description: 1-8 LNCS pages; a .zip or .tgz file of the CDs should be attached.
• Standard Enhancement Proposal: 1-12 LNCS pages (as appropriate w.r.t. the background knowledge required); a .zip or .tgz file of any related implementation (e.g. a Relax NG schema) should be attached.

PROCEEDINGS

Electronic proceedings will be published on the OpenMath web site in time for the conference.

WORKSHOP COMMITTEE

• James Davenport (The University of Bath)
• Michael Kohlhase (Jacobs University Bremen, Germany)
• Christoph Lange (Jacobs University Bremen, Germany)

### Near Bare Metal – Acunu

Wednesday, May 25th, 2011

Acunu Storage Platform

From the webpage:

The Acunu Storage Platform is a powerful storage solution that brings simpler, faster and more predictable performance to NOSQL stores like Apache Cassandra.

Our view is that the new data intensive workloads that are increasingly common are a poor match for the legacy storage systems they tend to run on. These systems are built on a set of assumptions about the capacity and performance of hardware that are simply no longer true. The Acunu Storage Platform is the result of a radical re-think of those assumptions; the result is high performance from low cost commodity hardware.

It includes the Acunu Storage Core which runs in the Linux kernel. On top of this core, we provide a modified version of Apache Cassandra. This is essentially the same as “vanilla” Cassandra but uses the Acunu Storage Core to store data instead of the Linux file system and is therefore able to take advantage of the performance benefits of our platform. In addition to Cassandra, there is also an object store similar to Amazon’s S3; we have a number of other more experimental projects in the pipeline which we’ll talk about in future posts.

Perhaps the start of something very interesting.

It took NoSQL a couple of years to flower into the range of current offerings.

I wonder if working in the kernel will have a similar path?

Will we see a graph engine as part of the kernel?

### Hadoop Dont’s: What not to do to harvest Hadoop’s full potential

Wednesday, May 25th, 2011

Hadoop Dont’s: What not to do to harvest Hadoop’s full potential by Iwona Bialynicka-Birula.

From the post:

We’ve all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that’s the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.

So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it’s not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.

An excellent post on problems to avoid with Hadoop!

### GraphStream 1.0 Release

Wednesday, May 25th, 2011

GraphStream 1.0 Release

From the website:

With GraphStream you deal with graphs. Static and Dynamic.
You create them from scratch, from a file or any source.
You display and render them.

From Getting Started:

GraphStream is a graph handling Java library that focuses on the dynamics aspects of graphs. Its main focus is on the modeling of dynamic interaction networks of various sizes.

The goal of the library is to provide a way to represent graphs and work on it. To this end, GraphStream proposes several graph classes that allow to model directed and undirected graphs, 1-graphs or p-graphs (a.k.a. multigraphs, that are graphs that can have several edges between two nodes).

GraphStream allows to store any kind of data attribute on the graph elements: numbers, strings, or any object.

Moreover, in addition, GraphStream provides a way to handle the graph evolution in time. This means handling the way nodes and edges are added and removed, and the way data attributes may appear, disappear and evolve.

You can also get an idea of the range of capabilities from the GraphStream 1.0 video.

### GraphStream 1.0 Video

Wednesday, May 25th, 2011

GraphStream 1.0 Video

I could roll this into a post about the GraphStream 1.0 release but this is a serious piece of work on its own.

The following connections demonstration should be on interest to the intelligence communities around the world.

High quality intelligence is no long the sole province of those who can afford one-off computer installations.

### Open government sites scrapped due to budget cuts

Wednesday, May 25th, 2011

Open government sites scrapped due to budget cuts

This isn’t so much surprising as it is disappointing. We now know the priority that “open” government in U.S. government budgetary discussions.

I could go on at length about this decision, the people who made it, complete with speculation on their motives, morals and parentage. Unfortunately, that would not restore the funding nor would it be a useful exercise.

As an alternative, let me suggest that everyone select one or two of the data sets that are already available and do something interesting. Something that will catch the imagination of the average citizen. Then credit these government sites as the sources and gently point out that with more funding, there would be more data. And hence more interesting things to see.

Asking someone at the agencies that produce data could result in interesting suggestions. They may lack the time, resources, personnel to do something really creative but with their ideas and your talents…, well, the result could interest the agency and the public. These agencies are the ones fighting on the inside of the public budget process for funding.

What data sets and ideas for those data sets do you think would have the most appeal or impact?

### The Architecture of Open Source Applications

Wednesday, May 25th, 2011

The Architecture of Open Source Applications by Amy Brown and Greg Wilson (eds).

From the website:

Architects look at thousands of buildings during their training, and study critiques of those buildings written by masters. In contrast, most software developers only ever get to know a handful of large programs well—usually programs they wrote themselves—and never study the great programs of history. As a result, they repeat one another’s mistakes rather than building on one another’s successes.

This book’s goal is to change that. In it, the authors of twenty-five open source applications explain how their software is structured, and why. What are each program’s major components? How do they interact? And what did their builders learn during their development? In answering these questions, the contributors to this book provide unique insights into how they think.

If you are a junior developer, and want to learn how your more experienced colleagues think, this book is the place to start. If you are an intermediate or senior developer, and want to see how your peers have solved hard design problems, this book can help you too.

I thought this might be of interest to the developer side of the topic map house.

One can imagine a similar volume for topic maps as well.

### Persistent Identifiers?

Tuesday, May 24th, 2011

Lutz Maicher tweeted about Identifier Persistence: Fundamentals yesterday.

It claims two foundations for identifier persistence:

1. Identifier persistence requires an organizational commitment. Persistence cannot be ensured by a few renegades in the skunk-works, nor can it be mandated from on high without the support of those who manage the identifiers or produce web resources. All individuals involved in the life-cycle of web resources must be committed to persistence in perpetuity if true persistence of identifiers is to be achieved.
2. No technology, no standard, no identifier scheme, no information architecture will get you persistence. Whether you choose native URIs, Handles, DOIs, PURLs, ARKs, UUIDs, or XRIs, you will never achieve identifier persistence without active management of your identifiers and web resources. This requires the aforementioned organizational commitment since such management cannot occur without sufficient resources. Management of web resources and identifiers requires time and due diligence and those don’t come for free.

(emphasis in original)

So, identifier persistence requires active management of identifiers and web resources?

But when I think of persistent identifiers, I have something more like:

in mind. ( Trevor Lowe)

It has been, what?, over 2,000 years without active management of identifiers and web resources and it still persists as an identifier.

And that is a fairly recent identifier in the great scheme of identifiers. There are those that are far older.

I don’t deny the convenience or utility of web identifiers. But in terms of persistence, where should we look for a digital Rosetta stone when the maintenance of opaque identifiers and 303 redirects have fallen into disuse? I have heard it mentioned that fifteen or twenty years is persistence for a web identifier. Perhaps so but realize that the persistence of the identifier for Cleopatra that appears above is more than two orders of magnitude greater.

How your business would be different today if there were a cone of information darkness only fifteen or twenty years (optimistic estimate) behind you? And with each passing year, another year drops into a digital abyss. Some things persist, others don’t. Usually the ones you want/need don’t. Or so it always seems.

My suggestion isn’t yet-another-persistence-proposal (YAPP). The ones that involve multi-century funding/staffing and proposals to bind future generations to our present notions of persistence syntax.

Let’s write web identifiers using (in part) identifiers that are already meaningful in our professions, occupations and hobbies. Identifiers that are not dependent particular resolution mechanisms or technologies. Identifiers that will persist long after their maintenance has failed. That is a step towards persistence.

### Cassa

Tuesday, May 24th, 2011

Cassa

From the webpage:

A SPARQL 1.1 Graph Store HTTP Protocol [1] implementation for RDF and Topic Maps.

The somewhat longer announcement on topicmapmail, SPARQL 1.1 Graph Store HTTP Protocol for Topic Maps:

Last week discovered the SPARQL 1.1 Graph Store HTTP Protocol [1] and I wondered if this wouldn’t be a good alternative to SDShare [2].

The graph store protocol uses no artificial technologies like Atom but uses REST and RDF consequently. The service uses an ontology [3] to inform the client about available graphs etc.

The protocol allows creation of graphs, deletion of graphs and updating graphs and discovery of graphs (through the service description).

The protocol is rather generic, so it’s usable for Topic Maps as well (graph == topic map).

The protocol provides no fragments/snapshots like SDShare, though. Adding these functionality to the protocol would be interesting, I’d think. I.e. each graph update would trigger a new fragment. Maybe this functionality would also solve the “push problem” [4] without inventing yet another syntax. The description of the available fragments should also be done with an ontology and not solely with Atom, though.

Anyway, I wanted to mention it as a good, *dogfooding* protocol which could be used for Topic Maps.

I created an implementation (Cassa) of the protocol at [5] (no release yet). The implementation supports Topic Maps and RDF but it doesn’t provide the service description yet. And I didn’t translate the service description ontology to Topic Maps yet.