Archive for January, 2013

Resources for ‘Data Visualisation for Analysis in Scholarly Research’

Thursday, January 31st, 2013

Resources for ‘Data Visualisation for Analysis in Scholarly Research’ by Mia Ridge.

From the post:

A collection of links for further reading for the British Library’s Digital Scholarship course on ‘Data Visualisation for Analysis in Scholarly Research’.

If you are engaged in scholarly research, excellent sources on data visualization.

If you are pitching topic maps to academic researchers, the sources will help you to better understand your user community and their expectations from visualization.

Open Data Protocol

Thursday, January 31st, 2013

Open Data Protocol

From the webpage:

There is a vast amount of data available today and data is now being collected and stored at a rate never seen before. Much, if not most, of this data however is locked into specific applications or formats and difficult to access or to integrate into new uses.

The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today. OData does this by applying and building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores. The protocol emerged from experiences implementing AtomPub clients and servers in a variety of products over the past several years. OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional Web sites.

OData is consistent with the way the Web works – it makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web). This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools.

I have mentioned this resource before but it was buried in a post and not a separate post.

The amount of documentation has grown and much improved since then.


Demining the “Join Bomb” with graph queries

Thursday, January 31st, 2013

Demining the “Join Bomb” with graph queries by Rik Van Bruggen.

From the post:

For the past couple of months, and even more so since the beer post, people have been asking me a question that I have been struggling to answer myself for quite some time: what is so nice about the graphs? What can you do with a graph database that you could not, or only at great pains, do in a traditional relational database system. Conceptually, everyone understands that this is because of the inherent query power in a graph traversal – but how to make this tangible? How to show this to people in a real and straightforward way?

And then Facebook Graph Search came along, along with it’s many crazy search examples – and it sort of hit me: we need to illustrate this with *queries*. Queries that you would not – or only with a substantial amount of effort – be able to do in traditional database system – and that are trivial in a graph.

This is what I will be trying to do in this blog post, using an imaginary dataset that was inspired by the Telecommunications industry. You can download the dataset here, but really it is very simple: a number of “general” data elements (countries, languages, cities), a number of “customer” data elements (person, company) and a number of more telecom-related data elements (operators – I actually have the full list of all mobile operators in the countries in the dataset coming from here and here, phones and conference call service providers).

Great demonstration using simulated telecommunications data of the power of graph queries.

Highly recommended!

Seeking Creative Use Cases for Thomson Reuters Web of Knowledge

Thursday, January 31st, 2013

Seeking Creative Use Cases for Thomson Reuters Web of Knowledge

From the post:

AWARD: $10,000 USD | DEADLINE: 2/24/13 | ACTIVE SOLVERS: 467 | POSTED: 1/18/13

This Challenge seeks use cases for Thomson Reuters Web of Knowledge content, tools, and APIs (Application Programming Interface) that would enable users to engage in creative new behaviors, beyond what is currently possible with online research portals. How will users want to search and discover scholarly content throughout the next 5 years?

This Challenge is an Ideation Challenge, with a guaranteed award for at least one submitted solution. In this first phase, the Seeker is looking for creative ideas/use cases; no programming or code delivery is required.

See the post for details and links.

Only the best idea is required. No eye candy to cover up a poor idea.

Welcome to the Unified Astronomy Thesaurus!

Thursday, January 31st, 2013

Welcome to the Unified Astronomy Thesaurus!

From the webpage:

The Unified Astronomy Thesaurus (UAT) will be an open, interoperable and community-supported thesaurus which unifies the existing divergent and isolated Astronomy & Astrophysics thesauri into a single high-quality, freely-available open thesaurus formalizing astronomical concepts and their inter-relationships. The UAT builds upon the existing IAU Thesaurus with major contributions from the Astronomy portions of the thesauri developed by the Institute of Physics Publishing and the American Institute of Physics. We expect that the Unified Astronomy Thesaurus will be further enhanced and updated through a collaborative effort involving broad community participation.

While the AAS has assumed formal ownership of the UAT, the work will be available under a Creative Commons License, ensuring its widest use while protecting the intellectual property of the contributors. We envision that development and maintenance will be stewarded by a broad group of parties having a direct stake in it. This includes professional associations (IVOA, IAU), learned societies (AAS, RAS), publishers (IOP, AIP), librarians and other curators working for major astronomy institutes and data archives.

The main impetus behind the creation of a single thesaurus has been the wish to support semantic enrichment of the literature, but we expect that use of the UAT (along with other vocabularies and ontologies currently being developed in our community) will be much broader and will have a positive impact on the discovery of a wide range of astronomy resources, including data products and services.

Several thesauri are listed as resources at this site.

Certainly would make an interesting topic map project.

I first saw this at: Science Reference: A New Thesaurus Created for the Astronomy Community by Gary Price.

Open Data Protocols, DCIP [A Topic Map Song]

Thursday, January 31st, 2013

Open Data Protocols, DCIP

From the post:

Have you ever heard about Data Catalog Interoperability Protocol (DCIP)? DCIP is a specification designed to facilitate interoperability between data catalogs published on the Web by defining:

  • a JSON and RDF representation for key data catalog entities such asDataset (DatasetRecord) and Resource (Distribution)based on the DCAT vocabulary
  • a read only REST based protocol for achieving basic catalog interoperability

Data Catalog Interoperability Protocol (DCIP) v0.2 discusses each of the above and provides examples. The approach is designed to be a pragmatic and easily implementable. It merges existing work on DCAT with the real-life experiences of “harvesting” in various projects.

To know more about DCIP, you can visit the  Open Data Protocols  website, which aims to make easier to develop tools and services for working with data, and, to ensure greater interoperability between new and existing tools and services.

The news of new formats, protocols and the like are music to topic map ears.

The refrain is: “cha-ching, cha-ching, cha-ching!”


Only partially in jest.

Every time a new format (read set of subjects) is developed for the encoding of data (another set of subjects), it is be definition different from all that came before.

With good reason. Every sentient being on the planet will be flocking to format/protocol X for all their data.

Well, except that flocking is more like a trickle for most new formats. Particularly when compared to the historical record of formats.

In theory topic maps are an exception to that rule, except that when you map specific instances of other data formats, you have committed yourself to a particular set of mappings.

Still, that’s better than rip-and-replace or ETL processing of data. It maintains backwards compatibility with existing systems while anticipating future systems.

Facebook Graph Search with Cypher and Neo4j

Thursday, January 31st, 2013

Facebook Ggraph Search with Cypher and Neo4j by Max De Marzi.

From the post:

Facebook Graph Search has given the Graph Database community a simpler way to explain what it is we do and why it matters. I wanted to drive the point home by building a proof of concept of how you could do this with Neo4j. However, I don’t have six months or much experience with NLP (natural language processing). What I do have is Cypher. Cypher is Neo4j’s graph language and it makes it easy to express what we are looking for in the graph. I needed a way to take “natural language” and create Cypher from it. This was going to be a problem.

If you think about “likes” as an association type with role players….

Of course, “like” paints with a broad brush but it is a place to start.

Saturday 23rd February is Open Data Day 2013!

Thursday, January 31st, 2013

Saturday 23rd February is Open Data Day 2013! from AIMS.

From the post:

Open Data Day is a gathering of citizens in cities around the world to write applications, liberate data, create visualizations and publish analyses using open public data to show support for and encourage the adoption of open data policies by the world’s local, regional and national governments. There are Open Data Day events taking place all around the world.

Are you are planning to organize or participate in one of these events? Are you going to launch new open data catalogs on the Open Data Day? Share with us your plans and highlight events that might be of interest for the agricultural information management community.

Know more at

As of today: 52 events.

Anyone interested in a virtual event on Open Data Day using open data and topic maps?

5 online tools in data visualization playground

Thursday, January 31st, 2013

5 online tools in data visualization playground

From the post:

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs.

To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground.


  1. Many Eyes (IBM)
  2. Circos
  3. Google Chart Tools
  4. Color Brewer
  5. Mr. Data Converter

Plus other related tools.

? Google Guide [Improve Google Searching]

Thursday, January 31st, 2013

? Google Guide by Nancy Blachman.

Non-official documentation for Google searching but very nice non-official documentation.

If you want to improve your Google searching, this is a good place to start!

Available in English, Dutch, German, Hebrew and Italian.

SQL, NoSQL =? CoSQL? Category Theory to the Rescue

Wednesday, January 30th, 2013

A co-Relational Model of Data for Large Shared Data Banks by Erik Meijer and Gavin Bierman.

I missed this when it appeared in March of 2011.

From the conclusion:

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor’s product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQL—namely, key-value stores as the mathematical dual of SQL’s foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

The ACM Digital Library reports only 3 citations, which is unfortunate for such an interesting proposal.

I have heard about key/value pairs somewhere else. I will have to think about that and get back to you. (Hint for the uninitiated, try the Topic Maps Reference Model (TMRM). A new draft of the TMRM is due to appear in a week or so.)

A Data Driven Approach to Query Expansion in Question Answering

Wednesday, January 30th, 2013

A Data Driven Approach to Query Expansion in Question Answering by Leon Derczynski, Jun Wang, Robert Gaizauskas, and Mark A. Greenwood.


Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.

In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method.

Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Work on query expansion in natural language answering systems. Closely related to synonymy.

Query expansion tools could be useful prompts for topic map authors seeking terms for mapping.

Collaborative Filtering via Group-Structured Dictionary Learning

Wednesday, January 30th, 2013

Collaborative Filtering via Group-Structured Dictionary Learning by Zoltan Szabo, Barnabas Poczos , and Andras Lorincz.


Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented method outperforms its state-of-the-art competitors and has several advantages over approaches that do not put structured constraints on the dictionary elements.

From the paper:

Novel advances on CF show that dictionary learning based approaches can be efficient for making predictions about users’ preferences [2]. The dictionary learning based approach assumes that (i) there is a latent, unstructured feature space (hidden representation/code) behind the users’ ratings, and (ii) a rating of an item is equal to the product of the item and the user’s feature.

Is a “preference” actually a form of subject identification?

I ask because the notion of a “real time” system is incompatible with users researching the proper canonical subject identifier and/or waiting for a response from an inter-departmental committee to agree on correct terminology.

Perhaps subject identification in some systems must be on the basis of “…latent, unstructured feature space[s]…” that are known (and disclosed) imperfectly at best.

Zoltán Szabó’s Home Page, numerous publications and the source code for this article.

Connecting to Neo4j using Spring Data

Wednesday, January 30th, 2013

Connecting to Neo4j using Spring Data by Anirvan Chakraborty.


Anirvan’s Progressive Java track will show how to use Spring Data Neo4j to build a Spring-based web application based on the graph database Neo4j. The session will begin with a short introduction into Spring Data Neo4j and follow that up by building a ‘User Management System’ using Neo4j. Anirvan will show just how easy it is to use Spring Data Neo4j to map entity classes to the Neo4j DB. The tutorial expects the attendees to have no previous knowledge of Neo4j or Spring Data Neo4j, but some experience of building web applications using Spring Framework would be helpful.

A bit dated now but possibly still useful introduction to Neo4j and Spring Data.

Identity In A Context

Wednesday, January 30th, 2013

Jasmine Ashton frames a quote about Julie Lynch, an archivist saying:

Due to the nature of her work, Lynch is the human equivalent of a search engine. However, she differs in one key aspect:

“Unlike Google, Lynch delivers more than search results, she provides context. That sepia-tinged photograph of the woman in funny-looking clothes on a funny-looking bicycle actually offers a window into the impact bicycles had on women’s independence. An advertisement touting “can build frame houses” demonstrates construction restrictions following the Great Chicago Fire. Surprisingly, high school yearbooks — the collection features past editions from Lane Tech, Amundsen and Lake View High Schools — serve as more than a cautionary tale in the evolution of hairstyles.”

Despite the increase in technology that makes searching information as easy as tapping a touch screen, this article reiterates the importance of having real people to contextualize these documents. (How Librarians Play an Integral Role When Searching for Historical Documents

Rather than say “contextualize,” I would prefer to say that librarians provide alternative “contexts” for historical documents.

Recognition of a document, or any other subject, takes place in a context. A librarian can offer the user different contexts in which to understand a document.

Doesn’t invalidate the initial context of understanding, simply becomes an alternative one.

Quite different from our search engines, which see only “matches” and no context for those matches.

Logic and Lattices for Distributed Programming

Wednesday, January 30th, 2013

Logic and Lattices for Distributed Programming

From the post:

Neil Conway from Berkeley CS is giving an advanced level talk at a meetup today in San Francisco on a new paper: Logic and Lattices for Distributed Programming – extending set logic to support CRDT-style lattices.

The description of the meetup is probably the clearest introduction to the paper:

Developers are increasingly choosing datastores that sacrifice strong consistency guarantees in exchange for improved performance and availability. Unfortunately, writing reliable distributed programs without the benefit of strong consistency can be very challenging.


In this talk, I’ll discuss work from our group at UC Berkeley that aims to make it easier to write distributed programs without relying on strong consistency. Bloom is a declarative programming language for distributed computing, while CALM is an analysis technique that identifies programs that are guaranteed to be eventually consistent. I’ll then discuss our recent work on extending CALM to support a broader range of programs, drawing upon ideas from CRDTs (A Commutative Replicated Data Type).

If you have an eye towards understanding the future then this is for you.

Do note that the Bloom language is treated more extensively in Datalog Reloaded. You may recall that the basis for tolog (a topic map query language) was Datalog.

Graph Based Classification Methods Using Inaccurate External Classifier Information

Wednesday, January 30th, 2013

Graph Based Classification Methods Using Inaccurate External Classifier Information by Sundararajan Sellamanickam and Sathiya Keerthi Selvaraj.


In this paper we consider the problem of collectively classifying entities where relational information is available across the entities. In practice inaccurate class distribution for each entity is often available from another (external) classifier. For example this distribution could come from a classifier built using content features or a simple dictionary. Given the relational and inaccurate external classifier information, we consider two graph based settings in which the problem of collective classification can be solved. In the first setting the class distribution is used to fix labels to a subset of nodes and the labels for the remaining nodes are obtained like in a transductive setting. In the other setting the class distributions of all nodes are used to define the fitting function part of a graph regularized objective function. We define a generalized objective function that handles both the settings. Methods like harmonic Gaussian field and local-global consistency (LGC) reported in the literature can be seen as special cases. We extend the LGC and weighted vote relational neighbor classification (WvRN) methods to support usage of external classifier information. We also propose an efficient least squares regularization (LSR) based method and relate it to information regularization methods. All the methods are evaluated on several benchmark and real world datasets. Considering together speed, robustness and accuracy, experimental results indicate that the LSR and WvRN-extension methods perform better than other methods.

Doesn’t read like a page-turner does it? 😉

An example from the paper will help illustrate why this is an important paper:

In this paper we consider a related relational learning problem where, instead of a subset of labeled nodes, we have inaccurate external label/class distribution information for each node. This problem arises in many web applications. Consider, for example, the problem of identifying pages about Public works, Court, Health, Community development, Library etc. within the web site of a particular city. The link and directory relations contain useful signals for solving such a classifi cation problem. Note that this relational structure will be diff erent for di fferent city web sites. If we are only interested in a small number of cities then we can a fford to label a number of pages in each site and then apply transductive learning using the labeled nodes. But, if we want to do the classifi cation on hundreds of thousands of city sites, labeling on all sites is expensive and we need to take a diff erent approach. One possibility is to use a selected set of content dictionary features together with the labeling of a small random sample of pages from a number of sites to learn an inaccurate probabilistic classifi er, e.g., logistic regression. Now, for any one city web site, the output of this initial classifi er can be used to generate class distributions for the pages in the site, which can then be used together with the relational information in the site to get accurate classifi cation.

In topic map parlance, we would say identity was being established by the associations in which a topic participates but that is a matter of terminology and not substantive difference.

Bad News From UK: … brows up, breasts down

Tuesday, January 29th, 2013

UK plastic surgery statistics 2012: brows up, breasts down by Ami Sedghi.

From the post:

Despite a recession and the government launching a review into cosmetic surgery following the breast implant scandal, plastic surgery procedures in the UK were up last year.

A total of 43,172 surgical procedures were carried out in 2012 according to the British Association of Aesthetic Plastic Surgeons (BAAPS), an increase of 0.2% on the previous year. Although there wasn’t a big change for overall procedures, anti-ageing treatments such as eyelid surgery and face lifts saw double digit increases.

Breast augmentation (otherwise known as ‘boob jobs’) were still the most popular procedure overall although the numbers dropped by 1.6% from 2011 to 2012. Last year’s stats took no account of the breast implant scandal so this is the first release of figures from BAAPS to suggest what impact the scandal has had on the popular procedure.

Just for comparison purposes:

Country Procedures Population Percent of Population Treated
UK 43,172 62,641,000 0.00068%
US 9,200,000 313,914,000 0.02900%

Perhaps beauty isn’t one of the claimed advantages of socialized medicine?

What is Climate Informatics?

Tuesday, January 29th, 2013

What is Climate Informatics? by Steve

From the post:

I’ve been using the term Climate Informatics informally for a few years to capture the kind of research I do, at the intersection of computer science and climate science. So I was delighted to be asked to give a talk at the second annual workshop on Climate Informatics at NCAR, in Boulder this week. The workshop has been fascinating – an interesting mix of folks doing various kinds of analysis on (often huge) climate datasets, mixing up techniques from Machine Learning and Data Mining with the more traditional statistical techniques used by field researchers, and the physics-based simulations used in climate modeling.

I was curious to see how this growing community defines itself – i.e. what does the term “climate informatics” really mean? Several of the speakers offered definitions, largely drawing on the idea of the Fourth Paradigm, a term coined by Jim Gray, who explained it as follows. Originally, science was purely empirical. In the last few centuries, theoretical science came along, using models and generalizations, and in the latter half of the twentieth century, computational simulations. Now, with the advent of big data, we can see a fourth scientific research paradigm emerging, sometimes called eScience, focussed on extracting new insights from vast collections of data. By this view, climate informatics could be defined as data-driven inquiry, and hence offers a complement to existing approaches to climate science.

However, there’s still some confusion, in part because the term is new, and crosses disciplinary boundaries. For example, some people expected that Climate Informatics would encompass the problems of managing and storing big data (e.g. the 3 petabytes generated by the CMIP5 project, or the exabytes of observational data that is now taxing the resources of climate data archivists). However, that’s not what this community does. So, I came up with my own attempt to define the term:

Fleshes out a term that gets tossed around without a lot of discussion.

Personally I have never understood the attraction of disciplinary boundaries. Other than as an “in” versus “out” crowd for journal/presentation acceptance.

Given the low citation rates in the humanities, being “in” a discipline, to say nothing of peer review, isn’t a guarantee of good work.

The Data Science Toolkit is now on Vagrant!

Tuesday, January 29th, 2013

The Data Science Toolkit is now on Vagrant! by Pete Warden.

From the post:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.

Before I discovered Vagrant, I’d attempted to do something similar with my Data Science Toolkit package, distributing a VMware image of a full linux system with all the software and data it required pre-installed. It was a large download, and a lot of people used it, but the setup took more work than I liked. Vagrant solved a lot of the usability problems around downloading VMs, so I’ve been eager to create a compatible version of the DSTK image. I finally had a chance to get that working over the weekend, so you can create your own local geocoding server just by running:

vagrant box add dstk

vagrant init

The box itself is almost 5GB with all the address data, so the download may take a while. Once it’s done go to http://localhost:8080 and you’ll see the web interface to the geocoding and unstructured data parsing functions.

Based on Oracle’s VirtualBox, this looks like a very cool way to distribute topic map applications with data.

Remember the Emulate Drug Dealers [Marketing Topic Maps] post?

I was very serious.

7 Incredible Web Design, Branding, Digital Marketing Experiences [Abandonment > 65%]

Tuesday, January 29th, 2013

7 Incredible Web Design, Branding, Digital Marketing Experiences by Avinash Kaushik.

From the post:

We are surrounded by incredible digital experiences. Masterful design, branding and marketing.

Yet, it would be fair to say we are also drowning in awful digital experiences – or, at the very minimum, experiences that seem to be stuck in 1991.

As a Digital Marketing Evangelist you can imagine how much that pains me.

When I work with companies, I do my very best to bring my deep and undying passion for creativity and digital awesomeness to them. One manifestation of that is the stories I tell by comparing and contrasting the client’s digital existence with others I consider best of breed.

In this blog post I want to try and do something similar by sharing some of my favorite digital experiences with you. There are 7 in total.

Each example is truly amazing and for each I’ll share my perspective on why. In each case there are also tips that highlight things that overtly or covertly make the company delightful.

What can you expect?

Inspiring landing pages, cool calls to action, delightful cart and checkout experiences, website copy delicious enough to eat, copy that convinces people to buy by respecting their intelligence, ecommerce reimagined, higher conversions via greater transparency, and examples of how to truly live your brand’s values online through an experience that leaves your customers happy and willing to pay more for your products!

I’m not promising anything this good on the renovation of but this is one source I am looking to for inspiration.

Sites you would like to suggest as incredible (in the good sense)? (The incredible in the bad sense are easy enough to find.)

BTW, I was very impressed by the “…cart abandonment rates routinely runs north of 65%…” line.

That’s higher than the U.S. divorce rate! 😉

An awesome post on web design!

Which of those lessons will you be applying to your website or topic map interface design?

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations

Tuesday, January 29th, 2013

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations by Arnab Bhattacharya and B. Palvali Teja.
(Submitted on 28 Jun 2012)


The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as “aggregate skyline join queries”. Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the algorithms with respect to the cardinality and dimensionality of the relations.

The authors illustrate a “skyline” query with a search for a hotel that has a good price and it close to the beach. A “skyline” set of hotels excludes hotels that are not as good on those points as hotels in the set. They then observe:

In real applications, however, there often exists a scenario when a single relation is not sufficient for the application, and the skyline needs to be computed over multiple relations [16]. For example, consider a flight database. A person traveling from city A to city B may use stopovers, but may still be interested in flights that are cheaper, have a less overall journey time, better ratings and more amenities. In this case, a single relation specifying all direct flights from A to B may not suffice or may not even exist. The join of multiple relations consisting of flights starting from A and those ending at B needs to be processed before computing the preferences.

The above problem becomes even more complex if the person is interested in the travel plan that optimizes both on the total cost as well as the total journey time for the two flights (other than the ratings and amenities of each
airline). In essence, the skyline now needs to be computed on attributes that have been aggregated from multiple relations in addition to attributes whose preferences are local within each relation. The common aggregate operations are sum, average, minimum, maximum, etc.

No doubt the travel industry thinks it has conquered semantic diversity in travel arrangements. If they have, it has since I stopped traveling several years ago.

Even simple tasks such as coordination of air and train schedules was unnecessarily difficult.

I suspect that is still the case and so mention “skyline” queries as a topic to be aware of and if necessary, to include in a topic map application that brings sanity to travel arrangements.

True, you can get a travel service that handles all the details, but only for a price and only if you are that trusting.

Bookmarks/Notes in PDF

Tuesday, January 29th, 2013

One of the advantages of the original topic maps standard, being based on HyTime, was its ability to point into documents. That is the structure of a document could be treated as an anchor for linking into the document.

Sadly I am not writing to announce the availability of a HyTime utility for pointing into PDF.

I am writing to list links to resources for creating bookmarks/notes in PDF.

Not the next best thing but a pale substitute until something better comes along.

Open Source:

JPdfBookmarks: Pdf bookmarks editor: Active project with excellent documentation (including on bookmarks themselves). GPLv3 license.

Ahem, commercial options:



Nitro Reader

Others that I have overlooked?

Pointing into PDF is an important issue because scanning/reading the same introductory materials on graphs in dozens of papers is tiresome.

A link directly to the material of interest would save time and quite possibly serve as an extraction point for collating the important bits from several papers together.

Think of it as automated note taking with the advantage of not forgetting to write down the proper citation information.

Importing CSV Data into Neo4j

Tuesday, January 29th, 2013

A Python utility for importing CSV data into a Neo4j database. neo4j-table-data.

A Formalism for Graph Databases and its Model of Computation

Tuesday, January 29th, 2013

A Formalism for Graph Databases and its Model of Computation by Tony Tan and Juan Reutter.


Graph databases are directed graphs in which the edges are labeled with symbols from a finite alphabet. In this paper we introduce a logic for such graphs in which the domain is the set of edges. We compare its expressiveness with the standard logic in which the domain the set of vertices. Furthermore, we introduce a robust model of computation for such logic, the so called graph pebble automata.

The abstract doesn’t really do justice to the importance of this paper for graph analysis. From the paper:

For querying graph structured data, one normally wishes to specify certain types of paths between nodes. Most common examples of these queries are conjunctive regular path queries [1, 14, 6, 3]. Those querying formalisms have been thoroughly studied, and their algorithmic properties are more or less understood. On the other hand, there has been much less work devoted on other formalisms other than graph reachability patterns, say, for example, the integrity constraints such as labels with unique names, typing constraints on nodes, functional dependencies, domain and range of properties. See, for instance, the survey [2] for more examples of integrity constraints.

The survey referenced in that quote is: Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1, Article 1 (February 2008), 39 pages. DOI=10.1145/1322432.1322433

The abstract for the survey reads:

Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors. These models took off in the eighties and early nineties alongside object-oriented models. Their influence gradually died out with the emergence of other database models, in particular geographical, spatial, semistructured, and XML. Recently, the need to manage information with graph-like nature has reestablished the relevance of this area. The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.

Recommended if you want to build upon what is already known and well-established about graph databases.

Kwong – … Word Sense Disambiguation

Tuesday, January 29th, 2013

New Perspectives on Computational and Cognitive Strategies for Word Sense Disambiguation
by Oi Yee Kwong.

From the description:

Cognitive and Computational Strategies for Word Sense Disambiguation examines cognitive strategies by humans and computational strategies by machines, for WSD in parallel.

Focusing on a psychologically valid property of words and senses, author Oi Yee Kwong discusses their concreteness or abstractness and draws on psycholinguistic data to examine the extent to which existing lexical resources resemble the mental lexicon as far as the concreteness distinction is concerned. The text also investigates the contribution of different knowledge sources to WSD in relation to this very intrinsic nature of words and senses.

I wasn’t aware that the “mental lexicon” of words had been fully described.

Shows what you can learn from reading marketing summaries of research.

CamelOne 2012 (videos/presentations) Boston, MA

Tuesday, January 29th, 2013

CamelOne 2012 (videos/presentations) Boston, MA

Videos and presentations for your enjoyment from the CamelOne 2012 conference.

As usual I was looking for something else and found more than I bargained for! 😉

Emulate Drug Dealers [Marketing Topic Maps]

Monday, January 28th, 2013

The title, “Emulate Drug Dealers,” is a chapter title in Rework by Jason Fried and David Heinemeier Hansson, 2010.

The chapter reads in part:

Drug Dealers get it right.

Drug dealers are astute businesspeople. They know their product is so good they’re willing to give a little away for free upfront. They know you’ll be back for more — with money.

Emulate drug dealers. Make your product so good, so addictive, so “can’t miss” that giving customers a small, free taste makes them come back with cash in hand.

This will force you to make something about your product bite-size. You want an easily disgestible introduction to what you sell. This gives people a way to try it without investing any money or a lot of time.

Open source software and the like for topic maps, doesn’t meet the test for “free” as described by Fried and Hansson.

Note the last sentence in the quote:

This gives people a way to try it without investing any money or a lot of time.

If I have to install the software, reconfigure my Java classpath, read a tutorial, plus some other documentation, then learn an editor, well, hell, I’ve lost interest already.

Once I am “sold” on topic maps, most of that won’t be a problem. But I have to be “sold” first.

I suspect that Fried and Hansson are serious about “bite-sized.” Doesn’t have to be an example of reliable merging of all extant linked data. 😉

Could be something far smaller, but clever and unexpected. Something that would catch the average user’s imagination.

If I knew of an example of “bite-sized” I would have started this post with it or have included it by now.

I don’t.

Trying to think of one and wanted to ask for your help in finding one or more.



Monday, January 28th, 2013

PoSSuM : Pocket Similarity Searching using Multi-Sketches

From the webpage:

Today, vast amounts of protein-small molecule binding sites can be found in the Protein Data Bank (PDB). Exhaustive comparison of them is computationally demanding, but useful in the prediction of protein functions and drug discovery. We proposed a tremendously fast algorithm called “SketchSort” that enables the enumeration of similar pairs in a huge number of protein-ligand binding sites. We conducted all-pair similarity searches for 3.4 million known and potential binding sites using the proposed method and discovered over 24 million similar pairs of binding sites. We present the results as a relational database Pocket Similarity Search using Multiple-Sketches (PoSSuM), which includes all the discovered pairs with annotations of various types (e.g., CATH, SCOP, EC number, Gene ontology). PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures. Basically, the users can search similar binding pockets using two search modes:

i) “Search K” is useful for finding similar binding sites for a known ligand-binding site. Post a known ligand-binding site (a pair of “PDB ID” and “HET code”) in the PDB, and PoSSuM will search similar sites for the query site.

ii) “Search P” is useful for predicting ligands that potentially bind to a structure of interest. Post a known protein structure (PDB ID) in the PDB, and PoSSuM will search similar known-ligand binding sites for the query structure.

Obviously useful for the bioinformatics crowd but relevant for topic maps as well.

In topic map terminology, the searches are for associations with a known role player in a particular role, leaving the other role player unspecified.

It does not define or seek an exact match but provides the user with data that may help them make a match determination.

Hoopl (Dataflow and Haskell)

Monday, January 28th, 2013

Hoopl: Dataflow Optimization Made Simple by Norman Ramsey, João Dias, and Simon Peyton Jones.


We present Hoopl, a Haskell library that makes it easy for compiler writers to implement program transformations based on dataflow analyses. The compiler writer must identify (a) logical assertions on which the transformation will be based; (b) a representation of such assertions, which should form a lattice of finite height; (c) transfer functions that approximate weakest preconditions or strongest postconditions over the assertions; and (d) rewrite functions whose soundness is justified by the assertions. Hoopl uses the algorithm of Lerner, Grove, and Chambers (2002), which can compose very simple analyses and transformations in a way that achieves the same precision as complex, handwritten “super-analyses.” Hoopl will be the workhorse of a new back end for the Glasgow Haskell Compiler (version 6.12, forthcoming).

Superceded by later work but recommended by the authors as a fuller introduction to Hoopl.

The superceding work, Hoopl: A Modular, Reusable Library for Dataflow Analysis and Transformation, has the following abstract:

Dataflow analysis and transformation of control-flow graphs is pervasive in optimizing compilers, but it is typically tightly interwoven with the details of a particular compiler. We describe Hoopl, a reusable Haskell library that makes it unusually easy to define new analyses and transformations for any compiler. Hoopl’s interface is modular and polymorphic, and it offers unusually strong static guarantees. The implementation encapsulates state-of-the-art algorithms (interleaved analysis and rewriting, dynamic error isolation), and it cleanly separates their tricky elements so that they can be understood independently.

I started with the later work. Better read in the order presented here.