Archive for the ‘Authoring Semantics’ Category

Web Page Structure, Without The Semantic Web

Saturday, May 30th, 2015

Could a Little Startup Called Diffbot Be the Next Google?

From the post:

Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.

But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.

Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.

Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.

The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.

There are several themes here that are relevant to topic maps.

First, it is true that most data starts with some structure, styles if you will, before it is presented for user consumption. Imagine an authoring application that automatically and unknown to its user, metadata that can then provide semantics for its data.

Second, the recognition of structure approach being used by Diffbot is promising in the large but should also be promising in the small as well. Local documents of a particular type are unlikely to have the variance of documents across the web. Meaning that with far less effort, you can build recognition systems that can empower more powerful searching of local document repositories.

Third, and perhaps most importantly, while the results may not be 100% accurate, the question for any such project should be how much accuracy is required? If I am mining social commentary blogs, a 5% error rate on recognition of speakers might be acceptable, because for popular threads or speakers, those errors are going to be quickly corrected. Unpopular threads or authors never followed, does that come under no harm/no foul?

Highly recommended for reading/emulation.

Interactive Entity Resolution in Relational Data… [NG Topic Map Authoring]

Wednesday, June 5th, 2013

Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation by Hyunmo Kang, Lise Getoor, Ben Shneiderman, Mustafa Bilgic, Louis Licamele.


Databases often contain uncertain and imprecise references to real-world entities. Entity resolution, the process of reconciling multiple references to underlying real-world entities, is an important data cleaning process required before accurate visualization or analysis of the data is possible. In many cases, in addition to noisy data describing entities, there is data describing the relationships among the entities. This relational data is important during the entity resolution process; it is useful both for the algorithms which determine likely database references to be resolved and for visual analytic tools which support the entity resolution process. In this paper, we introduce a novel user interface, D-Dupe, for interactive entity resolution in relational data. D-Dupe effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity’s relational context for making resolution decisions. Since resolution decisions often are interdependent, D-Dupe facilitates understanding this complex process through animations which highlight combined inferences and a history mechanism which allows users to inspect chains of resolution decisions. An empirical study with 12 users confirmed the benefits of the relational context visualization on the performance of entity resolution tasks in relational data in terms of time as well as users’ confidence and satisfaction.

Talk about a topic map authoring tool!

Even chains entity resolution decisions together!

Not to be greedy, but interactive data deduplication and integration in Hadoop would be a nice touch. 😉

Software: D-Dupe: A Novel Tool for Interactive Data Deduplication and Integration.

Autocomplete Search with Redis

Sunday, December 9th, 2012

Autocomplete Search with Redis

From the post:

When we launched GetGlue HD, we built a faster and more powerful search to help users find the titles they were looking for when they want to check-in to their favorite shows and movies as they typed into the search box. To accomplish that, we used the in-memory data structures of the Redis data store to build an autocomplete search index.

Search Goals

The results we wanted to autocomplete for are a little different than the usual result types. The Auto complete with Redis writeup by antirez explores using the lexicographical ordering behavior of sorted sets to autocomplete for names. This is a great approach for things like usernames, where the prefix typed by the user is also the prefix of the returned results: typing mar could return Mara, Marabel, and Marceline. The deal-breaking limitation is that it will not return Teenagers From Mars, which is what we want our autocomplete to be able to do when searching for things like show and movie titles. To do that, we decided to roll our own autocomplete engine to fit our requirements. (Updated the link to the “Auto complete with Redis” post.)

Rather like the idea of autocomplete being more than just string completion.

What if while typing a name, “autocompletion” returns one or more choices for what it thinks you may be talking about? With additional properties/characteristics, you can disambiguate your usage by allowing your editor to tag the term.

Perhaps another way to ease the burden of authoring a topic map.

Do Presidential Debates Approach Semantic Zero?

Thursday, October 18th, 2012

ReConstitution recreates debates through transcripts and language processing by Nathan Yau.

From Nathan’s post:

Part data visualization, part experimental typography, ReConstitution 2012 is a live web app linked to the US Presidential Debates. During and after the three debates, language used by the candidates generates a live graphical map of the events. Algorithms track the psychological states of Romney and Obama and compare them to past candidates. The app allows the user to get beyond the punditry and discover the hidden meaning in the words chosen by the candidates.

The visualization does not answer the thorny experimental question: Do presidential debates approach semantic zero?

Well, maybe the technique will improve by the next presidential election.

In the meantime, it was an impressive display of read time processing and analysis of text.

Imagine such an interface that was streaming text for you to choose subjects, associations between subjects, and the like.

Not trying to perfectly code any particular stretch of text but interacting with the flow of the text.

There are goals other than approaching semantic zero.

Oil Drop Semantics?

Sunday, January 15th, 2012

Interconnection of Communities of Practice: A Web Platform for Knowledge Management and some related material made me think of the French “oil drop” counter-insurgency strategy.

With one important difference.

In a counter-insurgency context, the oil drop strategy is being used to further the goals of counter-insurgency force. Whatever you think of those goals or the alleged benefits for the places covered by the oil drops, the fundamental benefit is to the counter-insurgency force.

In a semantic context, one that seeks to elicit the local semantics of a group, the goal is not the furtherance of an outside semantic, but the exposition of a local semantic with the goal of benefiting the group covered by the oil spot. That as the oil drop spreads, those semantics may be combined with other oil drop semantics, but that is a cost and effort borne by the larger community seeking that benefit.

There are several immediate advantages to this approach with semantics.

First, the discussion of semantics at every level is taking place with the users of those semantics. You can hardly get closer to a useful answer than being able to ask the users of a semantic what was meant or for examples of usage. I don’t have a formalism for it but I would postulate that as the distance from users increases, so does the usefulness of the semantics of those users.

Ask the FBI about the Virtual Case Management project. Didn’t ask users or at least enough of them and flushed lots of cash. Lesson: Asking management, IT, etc., about the semantics of users is a utter waste of time. Really.

If you want to know the semantics of user group X, then ask group X. If you ask Y about X, you will get Y’s semantics about X. If that is what you want, fine, but if you want the semantics of group X, you have wasted your time and resources.

Second, asking the appropriate group of users for their semantics means that you can make explicit the ROI from making their semantics explicit. That is to say if asked, the group will ask about semantics that are meaningful to them. That either solve some task or issue that they encounter. May or may not be the semantics that interest you but recall the issue is the group’s semantics, not yours.

The reason for the ROI question at the appropriate group level is so that the project is justified both to the group being asked to make the effort as well as those who must approve the resources for such a project. Answering that question up front helps get buy-in from group members and makes them realize this isn’t busy work but will have a positive benefit for them.

Third, such a bottom-up approach, whether you are using topic maps, RDF, etc. will mean that only the semantics that are important to users and justified by some positive benefit are being captured. Your semantics may not have the rigor of SUMO, for example, but they are a benefit to you. What other test would you apply?

When Gamers Innovate

Monday, November 7th, 2011

When Gamers Innovate

The problem (partially):

Typically, proteins have only one correct configuration. Trying to virtually simulate all of them to find the right one would require enormous computational resources and time.

On top of that there are factors concerning translational-regulation. As the protein chain is produced in a step-wise fashion on the ribosome, one end of a protein might start folding quicker and dictate how the opposite end should fold. Other factors to consider are chaperones (proteins which guide its misfolded partner into the right shape) and post-translation modifications (bits and pieces removed and/or added to the amino acids), which all make protein prediction even harder. That is why homology modelling or “machine learning” techniques tend to be more accurate. However, they all require similar proteins to be already analysed and cracked in the first place.

The solution:

Rather than locking another group of structural shamans in a basement to perform their biophysical black magic, the “Fold It” team created a game. It uses human brainpower, which is fuelled by high-octane logic and catalysed by giving it a competitive edge. Players challenge their three-dimensional problem-solving skills by trying to: 1) pack the protein 2) hide the hydrophobics and 3) clear the clashes.

Read the post or jump to the Foldit site.

Seems to me there are a lot of subject identity and relationship (association) issues that are a lot less complex that protein folding. Not that topic mappers should shy away from protein folding but we should be more imaginative about our authoring interfaces. Yes?

A machine learning toolbox for musician
computer interaction

Tuesday, July 26th, 2011

A machine learning toolbox for musician computer interaction


This paper presents the SARC EyesWeb Catalog, (SEC), a machine learning toolbox that has been specifically developed for musician-computer interaction. The SEC features a large number of machine learning algorithms that can be used in real-time to recognise static postures, perform regression and classify multivariate temporal gestures. The algorithms within the toolbox have been designed to work with any N-dimensional signal and can be quickly trained with a small number of training examples. We also provide the motivation for the algorithms used for the recognition of musical gestures to achieve a low intra-personal generalisation error, as opposed to the inter-personal generalisation error that is more common in other areas of human-computer interaction.

Recorded at: 11th International Conference on New Interfaces for Musical Expression. 30 May – 1 June 2011, Oslo, Norway.

The paper: A machine learning toolbox for musician computer interaction

The software: SARC EyesWeb Catalog [SEC]

Although written in the context of musician-computer interaction, the techniques described here could just as easily be applied to exploration or authoring of a topic map. Or for that matter exploring a data stream that is being presented to a user.

Imagine that one hand gives “focus” to some particular piece of data and the other hand “overlays” a query onto that data that then displays a portion of a topic map with that data as the organizing subject. Based on that result the data can be simply dumped back into the data stream or “saved” for further review and analysis.

A Survey On Games For Knowledge Acquisition

Tuesday, July 5th, 2011

A Survey On Games For Knowledge Acquisition by Stefan Thaler, Katharina Siorpaes, Elena Simperl, and, Christian Hofer.


Many people dedicate their free time with playing games or following game related activities. The Casual Games Market Report 2007[3] names games with more than 300 million downloads. Moreover, the Casual Games Association reports more than 200 million casual gamers worldwide [4]. People play them for various reasons, such as to relax, to be entertained, for the need of competition and to be thrilled[9]. Additionally they want to be challenged, mentally as well skill based. As earlier mentioned there are tasks that are relatively easy to complete by humans but computationally rather infeasible to solve[27]. The idea to integrate such tasks as goal of games has been created and realized in platforms such as OntoGame[21],GWAP[26] and others. Consequently, they have produced a win-win situation where people had fun playing games while actually doing something useful, namely producing output data which can be used to improve the experience when dealing with data. That is why we in this describe state of the art games. Firstly, we briefly introduce games for knowledge acquisition. Then we outline various games for semantic content creation we found, grouped by the task they attempt to fulfill. We then provide an overview over these games based on various criteria in tabular form.

Interesting survey of the field that will hopefully be updated every year or even made into an online resource that can change as new games emerge.

Curious about two possibilities for semantic games:

1) Has anyone made a first-person shooter game based on recognition of facial images of politicians? Thinking that if you were given a set of “bad” guys to recognize for each level, you could shot those plus the usual combatants. The images in the game would be draw from news footage, etc. Thinking this might attract political devotees. I even have a good name for it: “Term Limits.”

2) On the theory that there is no one nosier than a neighbor, why not create an email tagging game where anonymous co-workers get to tag your email (both in and out)? That would be one way to add semantic value to corporate email and generate a lot of interest in doing so. Possible name: “Heard at Water Cooler.”

Additive Semantic Apps.

Tuesday, July 5th, 2011

10 Ways to make your Semantic App. addictive – Revisited

People after my own heart! Let’s drop all the pretense! We want people to use our apps to the exclusion of other apps. We want people to give up sleep to use our apps! We want people to call in sick, forget to eat, forget to put out the cat…, sorry, got carried away. 😉

Seriously, creating apps that people “buy into” is critical for the success of any app and no less so for semantic apps.

The less colorful summary about the workshop says:

In many application scenarios useful semantic content can hardly be created (fully) automatically, but motivating people to become an active part of this endeavor is still an art more than a science. In this tutorial we will look into fundamental design issues of semantic-content authoring technology – and of the applications deploying such technology – in order to find out which incentives speak to people to become engaged with the Semantic Web, and to determine the ways these incentives can be transferred into technology design. We will present how methods and techniques from areas as diverse as participation management, usability engineering, mechanism design, social computing, and game mechanics can be jointly applied to analyze semantically enabled applications, and subsequently design incentives-compatible variants thereof. The discussion will be framed by three case studies on the topics of enterprise knowledge management, media and entertainment, and IT ecosystems, in which combinations of these methods and techniques has led to increased user participation in creating useful semantic descriptions of various types of digital resources – text documents, images, videos and Web services and APIs. Furthermore, we will revisit the best practices and guidelines that have been at the core of an earlier version of this tutorial at the previous edition of the ISWC in 2010, following the empirical findings and insights gained during the operation of the three case studies just mentioned. These guidelines provide IT developers with a baseline to create technology and end-user applications that are not just functional, but facilitate and encourage user participation that supports the further development of the Semantic Web.

Well, they can say: “…facilitate and encourage user participation…” but I’m in favor of addition. 😉

BTW, notice the Revisited in the title?

You can see the slides from last year, 10 Ways to make your Semantic App. addictive, while you are waiting for this year’s workshop. (I am searching for videos but so far have come up empty. Maybe the organizers can film the presentations this year?)

Date: October 23 or 24, half day
Place: Bonn, Germany, Maritim Bonn


Tuesday, July 5th, 2011

INSEMTIVES: Incentives for Semantics

From the about:

The objective of INSEMTIVES is to bridge the gap bet­ween human and computational intell­igence in the current semantic content authoring R&D land­scape. The project aims at pro­ducing metho­dologies, methods and tools that enable the massive creation and feasible manage­ment of semantic cont­ent in order to facilitate the world-­wide up­take of semantic tech­nologies.

You have to hunt for it (better navigation needed?) but there is a gaming kit for INSEMTIVES at SourceForge.

A mother lode of resources on methods for the creation of semantic content that aren’t boring. 😉