Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 1, 2011

Linked Data for Education and Technology-Enhanced Learning (TEL)

Filed under: Education,Linked Data,LOD — Patrick Durusau @ 8:27 pm

Linked Data for Education and Technology-Enhanced Learning (TEL)

From the website:

Interactive Learning Environments special issue on Linked Data for Education and Technology-Enhanced Learning (TEL)

IMPORTANT DATES
================

  • 30 November 2011: Paper submission deadline (11:59pm Hawaiian time)
  • 30 March 2012: Notification of first review round
  • 30 April 2012: Submission of major revisions
  • 15 July 2012: Notification of major revision reviews
  • 15 August 2012: Submission of minor revisions
  • 30 August 2012: Notification of acceptance
  • late 2012 : Publication

OVERVIEW
=========

While sharing of open learning and educational resources on the Web became common practice throughout the last years a large amount of research was dedicated to interoperability between educational repositories based on semantic technologies. However, although the Semantic Web has seen large-scale success in its recent incarnation as a Web of Linked Data, there is still only little adoption of the successful Linked Data principles in the domains of education and technology-enhanced learning (TEL). This special issue builds on the fundamental belief that the Linked Data approach has the potential to fulfill the TEL vision of Web-scale interoperability of educational resources as well as highly personalised and adaptive educational applications. The special issue solicits research contributions exploring the promises of the Web of Linked Data in TEL by gathering researchers from the areas of the Semantic Web and educational science and technology.

TOPICS OF INTEREST
=================

We welcome papers describing current trends on research in (a) how technology-enhaced learning approaches take advantage of Linked Data on the Web and (b) how Linked Data principles and semantic technologies are being applied in technology-ehnaced learning contexts. Both rather application-oriented as well as theoretical papers are welcome. Relevant topics include but are not limited to the following:

  • Using Linked Data to support interoperability of educational resources
  • Linked Data for informal learning
  • Personalisation and context-awareness in TEL
  • Usability and advanced user interfaces in learning environments and Linked Data
  • Light-weight TEL metadata schemas
  • Exposing learning object metadata via RDF/SPARQL & service-oriented approaches
  • Semantic & syntactic mappings between educational metadata schemas and standards
  • Controlled vocabularies, ontologies and terminologies for TEL
  • Personal & mobile learning environments and Linked Data
  • Learning flows and designs and Linked Data
  • Linked Data in (visual) learning analytics and educational data mining
  • Linked Data in organizational learning and learning organizations
  • Linked Data for harmonizing individual learning goals and organizational objectives
  • Competency management and Linked Data
  • Collaborative learning and Linked Data
  • Linked-data driven social networking collaborative learning

September 30, 2011

DBpedia Spotlight v0.5 – Shedding Light on the Web of Documents

Filed under: DBpedia,Linked Data,LOD — Patrick Durusau @ 7:07 pm

DBpedia Spotlight v0.5 – Shedding Light on the Web of Documents by Pablo Mendes (email announcement)

We are happy to announce the release of DBpedia Spotlight v0.5 – Shedding Light on the Web of Documents.

DBpedia Spotlight is a tool for annotating mentions of DBpedia entities and concepts in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. The DBpedia Spotlight Architecture is composed by the following modules:

  • Web application, a demonstration client (HTML/Javascript UI) that allows users to enter/paste text into a Web browser and visualize the resulting annotated text.
  • Web Service, a RESTful Web API that exposes the functionality of annotating and/or disambiguating resources in text. The service returns XML, JSON or XHTML+RDFa.
  • Annotation Java / Scala API, exposing the underlying logic that performs the annotation/disambiguation.
  • Indexing Java / Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.

In this release we have provided many enhancements to the Web Service, installation process, as well as the spotting, candidate selection, disambiguation and annotation stages. More details on the enhancements are provided below.

The new version is deployed at:

Instructions on how to use the Web Service are available at: http://spotlight.dbpedia.org

We invite your comments on the new version before we deploy it on our production server. We will keep it on the “dev” server until October 6th, when we will finally make the switch to the production server at http://spotlight.dbpedia.org/demo/ and http://spotlight.dbpedia.org/rest/

If you are a user of DBpedia Spotlight, please join dbp-spotlight-users@lists.sourceforge.net for announcements and other discussions.

Warning: I think they are serious about the requirement of Firefox 6.0.2 and Chromium 12.0.

I tried it on an older version of Firefox on Ubuntu and got no results at all. Will upgrade Firefox but only in my directory.

September 29, 2011

Beyond the Triple Count

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 6:38 pm

Beyond the Triple Count by Leigh Dodds.

From the post:

I’ve felt for a while now that the Linked Data community has an unhealthy fascination on triple counts, i.e. on the size of individual datasets.

This was quite natural in the boot-strapping phase of Linked Data in which we were primarily focused on communicating how much data was being gathered. But we’re now beyond that phase and need to start considering a more nuanced discussion around published data.

If you’re a triple store vendor then you definitely want to talk about the volume of data your store can hold. After all, potential users or customers are going to be very interested in how much data could be indexed in your product. Even so, no-one seriously takes a headline figure at face value. As users we’re much more interested in a variety of other factors. For example how long does it take to load my data? Or, how well does a store perform with my usage profile, taking into account my hardware investment? Etc. This is why we have benchmarks, so we can take into account additional factors and more easily compare stores across different environments.

But there’s not nearly enough attention paid to other factors when evaluating a dataset. A triple count alone tells us nothing. They’re not even a good indicator of the number of useful “facts” in a dataset.

Watch Leigh’s presentation (embedded with his post) and read the post.

I think his final paragraph sets the goal for a wide variety of approaches, however we might disagree about how to best get there! 😉

Very much worth your time to read and ponder.

September 27, 2011

Linked Data Semantic Issues (same for topic maps?)

Filed under: Linked Data,LOD,Marketing,Merging,Topic Maps — Patrick Durusau @ 6:51 pm

Sebastian Schaffert posted a message on the pub-lod@w3c.org list that raised several issues about Linked Data. Issues that sound relevant to topic maps. See what you think.

From the post:

We are working together with many IT companies (with excellent software developers) and trying to convince them that Semantic Web technologies are superior for information integration. They are already overwhelmed when they have to understand that a database ID for an object is not enough. If they have to start distinguishing between the data object and the real world entity the object might be representing, they will be lost completely.

I guess being told that a “real world entity” may have different ways to be identified must seem to be the road to perdition.

Curious because the “real world” is a messy place. Or is that the problem? That the world of developers is artificially “clean,” at least as far as identification and reference.

Perhaps CS programs need to train developers for encounter with the messy “real world.”

From the post:

> When you dereference the URL for a person (such as …/561666514#), you get back RDF. Our _expectation_, of course, is that that RDF will include some remarks about that person (…/561666514#), but there can be no guarantee of this, and no guarantee that it won’t include more information than you asked for. All you can reliably expect is that _something_ will come back, which the service believes to be true and hopes will be useful. You add this to your knowledge of the world, and move on.

There I have my main problem. If I ask for “A”, I am not really interested in “B”. What our client implementation therefore does is to throw away everything that is about B and only keeps data about A. Which is – in case of the FB data – nothing. The reason why we do this is that often you will get back a large amount of irrelevant (to us) data even if you only requested information about a specific resource. I am not interested in the 999 other resources the service might also want to offer information about, I am only interested in the data I asked for. Also, you need to have some kind of “handle” on how to start working with the data you get back, like:
1. I ask for information about A, and the server gives me back what it knows about A (there, my expectation again …)
2. From the data I get, I specifically ask for some common properties, like A foaf:name ?N and do something with the bindings of N. Now how would I know how to even formulate the query if I ask for A but get back B?

Ouch! That one cuts a little close. 😉

What about the folks who are “…not really interested in ‘B’.” ?

How do topic maps serve their interests?

Or have we decided for them that more information about a subject is better?

Or is that a matter of topic map design? What information to include?

That “merging” and what gets “merged” is a user/client decision?

That is how it works in practice simply due to time, resources, and other constraints.

Marketing questions:

How to discover data users would like to have appear with other data, prior to having a contract to do so?

Can we re-purpose search logs for that?

September 9, 2011

LATC – Linked Open Data Around-the-Clock

Filed under: Government Data,Linked Data,LOD — Patrick Durusau @ 7:10 pm

LATC – Linked Open Data Around-the-Clock

This appears to be an early release of the site because it has an “unfinished” feel to it. For example, you to poke around a bit to find the tools link. And it isn’t clear how the project intends to promote the use of those tools or originate others to promote the use of linked data.

I suppose it is too late to avoid the grandiose “around-the-clock” project name? Web servers, barring some technical issue, are up 24 x 7. They keep going even as we sleep. Promise.

Objectives:

increase the number, the quality and the accuracy of data links between LOD datasets. LATC contributes to the evolution of the World Wide Web into a global data space that can be exploited by applications similar to a local database today. By increasing the number and quality of data links, LATC makes it easier for European Commission-funded projects to use the Linked Data Web for research purposes.

support institutions as well as individuals with Linked Data publication and consumption. Many of the practical problems that a European Commission-funded project may discover when interaction with the Web of Data are solved on the conceptual level and the solutions have been implemented into freely available data publication and consumption tools. What is still missing is the dissemination of knowledge about how to use these tools to interact with the Web of Linked Data. We aim at providing this knowledge.

create an in-depth test-bed for data intensive applications by publishing datasets produced by the European Commission, the European Parliament, and other European institutions as Linked Data on the Web and by interlinking them with other governmental data, such as found in the UK and elsewhere.

September 8, 2011

Press.net News Ontologies & rNews

Filed under: Linked Data,LOD,Ontology — Patrick Durusau @ 5:58 pm

Press.net News Ontologies

From the webpage:

The news ontology is comprised of several ontologies, which describe assets (text, images, video) and the events and entities (people, places, organisations, abstract concepts etc.) that appear in news content. The asset model is the representation of news content as digital assets created by a news provider (e.g. text, images, video and data such as csv files). The domain model is the representation of the ‘real world’ which is the subject of news. There are simple entities, which we have labelled with the amorphous term of ‘stuff‘ and complex entities. Currently, the only complex entity the ontology is concerned with is events. The term stuff has been used to include abstract and intangible concepts (e.g. Feminism, Love, Hate Crime etc.) as well as tangible things (e.g. Lord Ashdown, Fiat Punto, Queens Park Rangers).

Assets (news content) are about things in the world (the domain model). The connection between assets and the entities that appear in them is made using tags. Assets are further holistically categorised using classification schemes (e.g. IPTC Media Topic Codes, Schema.org Vocabulary or Press Association Categorisations).

No sooner had I seen that on the LOD list, than Stephanie Corlosquet pointed out rNews, another ontology for news.

From the rNews webpage:

rNews is a proposed standard for using RDFa to annotate news-specific metadata in HTML documents. The rNews proposal has been developed by the IPTC, a consortium of the world’s major news agencies, news publishers and news industry vendors. rNews is currently in draft form and the IPTC welcomes feedback on how to improve the standard in the rNews Forum.

I am sure there are others.

Although I rather like stuff as an alternative to SUMO’s thing or was that Cyc?

The point being that mapping strategies, when the expense can be justified, are the “answer” to the semantic diversity and richness of human discourse.

September 2, 2011

Improving the recall of decentralised linked data querying through implicit knowledge

Filed under: Linked Data,LOD,SPARQL — Patrick Durusau @ 8:02 pm

Improving the recall of decentralised linked data querying through implicit knowledge by Jürgen Umbrich, Aidan Hogan, Axel and Polleres.

Abstract:

Aside from crawling, indexing, and querying RDF data centrally, Linked Data principles allow for processing SPARQL queries on-the-fly by dereferencing URIs. Proposed link-traversal query approaches for Linked Data have the benefits of up-to-date results and decentralised (i.e., client-side) execution, but operate on incomplete knowledge available in dereferenced documents, thus affecting recall. In this paper, we investigate how implicit knowledge – specifically that found through owl:sameAs and RDFS reasoning – can improve the recall in this setting. We start with an empirical analysis of a large crawl featuring 4 m Linked Data sources and 1.1 g quadruples: we (1) measure expected recall by only considering dereferenceable information, (2) measure the improvement in recall given by considering rdfs:seeAlso links as previous proposals did. We further propose and measure the impact of additionally considering (3) owl:sameAs links, and (4) applying lightweight RDFS reasoning (specifically {\rho}DF) for finding more results, relying on static schema information. We evaluate our methods for live queries over our crawl.

From the document:

owl:sameAs links are used to expand the set of query relevant sources, and owl:sameAs rules are used to materialise implicit knowledge given by the OWL semantics, potentially generating additional answers.

I have always thought that knowing the “why” an owl:sameAs would make it more powerful. But since any basis for subject sameness can be used, that may not be the case.

August 25, 2011

SERIMI…. (Have you washed your data?)

Filed under: Linked Data,LOD,RDF,Similarity — Patrick Durusau @ 7:04 pm

SERIMI – Resource Description Similarity, RDF Instance Matching and Interlinking

From the website:

The interlinking of datasets published in the Linked Data Cloud is a challenging problem and a key factor for the success of the Semantic Web. Manual rule-based methods are the most effective solution for the problem, but they require skilled human data publishers going through a laborious, error prone and time-consuming process for manually describing rules mapping instances between two datasets. Thus, an automatic approach for solving this problem is more than welcome. We propose a novel interlinking method, SERIMI, for solving this problem automatically. SERIMI matches instances between a source and a target datasets, without prior knowledge of the data, domain or schema of these datasets. Experiments conducted with benchmark collections demonstrate that our approach considerably outperforms state-of-the-art automatic approaches for solving the interlinking problem on the Linked Data Cloud.

SERIMI-TECH-REPORT-v2.pdf

From the Results section:

The poor performance of SERIMI in the Restaurant1-Reataurant2 is mainly due to missing alignment in the reference set. The poor performance in the Person21-Person22 pair is due to the nature of the data. These datasets where built by adding spelling mistakes to the properties and literals values of their original datasets. Also only instances of class Person were retrieved into the pseudo-homonym sets during the interlinking process.

Impressive work overall but isn’t dirty data really the test? Just about any process can succeed with clean data.

Or is that really the weakness of the Semantic Web? That it requires clean data?

August 10, 2011

LOD cloud diagram – Next Version

Filed under: Linked Data,LOD,Semantic Web — Patrick Durusau @ 7:17 pm

Anja Jentsch posted the following call on the public-lod@w3.org list:

we would like to thank you for putting so much effort in curating the CKAN packages for Linked Data sets since our last call.

We have compiled statistics for the 256 data sets[1] on CKAN that will be included in the next LOD Cloud: http://lod-cloud.net/state

Altogether 446 data sets are currently tagged on CKAN as LOD [2]. But the description of many of these data sets is still incomplete so that we can not find out whether they fulfil the minimal requirements for being included into the LOD cloud diagram (dereferencable URIs and RDF links to or from other data sources).

A list of data sets that could not include yet and an explanation of what is missing can be found here: http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/

Starting next week we will generate the next LOD cloud diagram [3].

Therefore we would like to invite those of you who publish data sets that we could not include yet to please review and update your entries. Please finalize your dataset descriptions until August 15th to ensure that your data set will be part of the LOD Cloud.

In order to aid you in this quest, we have provided a validation page for your CKAN entry with step-by-step guidance for the information needed:
http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/

You can use the CKAN entry for DBpedia as an example:
http://ckan.net/package/dbpedia

Thank you for helping!

Cheers,
Anja, Chris and Richard

[1] http://ckan.net/package/search?q=groups:lodcloud+AND+-tags:lodcloud.unconnected+AND+-tags:lodcloud.needsfixing
[2] http://ckan.net/tag/lod
[3] http://lod-cloud.net/

Just a reminder, today is the 10th of August so don’t wait to review your entry.

Whatever your approach, we all benefit from cleaner data.

July 6, 2011

A Survey On Data Interlinking Methods

Filed under: Linked Data,LOD,RDF — Patrick Durusau @ 2:11 pm

A Survey On Data Interlinking Methods by Stephan Wölger, Katharina Siorpaes, Tobias Bürger, Elena Simperl, Stefan Thaler, and, Christian Hofer.

From the introduction:

In 2007 the Linking Open Data (LOD) community project started an initiative which aims at increased use of Semantic Web applications. Such applications on the one hand provide new means to enrich a user’s web experience but on the other hand also require certain standards to be adhered to. Two important requirements when it comes to Semantic Web applications are the availability of RDF datasets on the web and having typed links between these datasets in order to be able to browse the data and to jump between them in various directions.

While there exist tools that create RDF output automatically from the application level and tools that create RDF from web sites, interlinking the resulting datasets is still a task that can be cumbersome for humans (either because there is a lack of insentives or due the non-availability of user friendly tools) or not doable for machines (due to the manifoldness of domains). Despite the fact that there are more and more interlinking tools available, those either can be applied only for certain domains of the real world (e.g. publications) or they can be used just for interlinking a specific type of data (e.g. multimedia data).

Another interesting survey article from the Semantic Technology Institute (STI) Innsbruck, University of Innsbruck.

I like the phrase “…manifoldness of domains.” RDF output is useful information about data. The problem I foresee is that the semantics it represents are local, hence the “manifoldness of domains.” Not always, there are some domains that are so close as to not be distinguishable, one from the other, and linking RDF will work quite well.

One imagines that RDF based interlinking OfficeDepot, Staples and OfficeMax should not be difficult. Tiresome, not terribly interesting, but not difficult. And that could prove to be useful for personal and corporate buyers seeking price breaks or competitors trying to decide on loss leaders. Not a lot of reasoning to be done except by the buyers and sellers.

I am sure there would still be some domain differences between those vendors but having a common mapping from one vendor number to all three vendor numbers could prove to be very useful for customers and distributors alike.

For more complex/abstract domains, where “…manifoldness of domains.” is an issue, you can use topic maps.

June 29, 2011

Providing and discovering definitions of URIs

Filed under: Identifiers,Linked Data,LOD,OWL,RDF,Semantic Web — Patrick Durusau @ 9:10 am

Providing and discovering definitions of URIs by Jonathan A. Rees.

Abstract:

The specification governing Uniform Resource Identifiers (URIs) [rfc3986] allows URIs to mean anything at all, and this unbounded flexibility is exploited in a variety contexts, notably the Semantic Web and Linked Data. To use a URI to mean something, an agent (a) selects a URI, (b) provides a definition of the URI in a manner that permits discovery by agents who encounter the URI, and (c) uses the URI. Subsequently other agents may not only understand the URI (by discovering and consulting the definition) but may also use the URI themselves.

A few widely known methods are in use to help agents provide and discover URI definitions, including RDF fragment identifier resolution and the HTTP 303 redirect. Difficulties in using these methods have led to a search for new methods that are easier to deploy, and perform better, than the established ones. However, some of the proposed methods introduce new problems, such as incompatible changes to the way metadata is written. This report brings together in one place information on current and proposed practices, with analysis of benefits and shortcomings of each.

The purpose of this report is not to make recommendations but rather to initiate a discussion that might lead to consensus on the use of current and/or new methods.

The criteria for success:

  1. Simple. Having too many options or too many things to remember makes discovery fragile and impedes uptake.
  2. Easy to deploy on Web hosting services. Uptake of linked data depends on the technology being accessible to as many Web publishers as possible, so should not require control over Web server behavior that is not provided by typical hosting services.
  3. Easy to deploy using existing Web client stacks. Discovery should employ a widely deployed network protocol in order to avoid the need to deploy new protocol stacks.
  4. Efficient. Accessing a definition should require at most one network round trip, and definitions should be cacheable.
  5. Browser-friendly. It should be possible to configure a URI that has a discoverable definition so that ‘browsing’ to it yields information useful to a human.
  6. Compatible with Web architecture. A URI should have a single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name.

.

I had to look it up to get the page number but I remembered Karl Wiegers in Software Requirements saying:

Feasible

It must be possible to implement each requirement within the known capabilities and limitations of the system and its environment.

The single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name requirement is not feasible. It will stymie this project, despite the array of talent on hand, until it is no longer a requirement.

Need proof? Name one URI with a single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name.

Not one that the W3C TAG, or TBL or anyone else thinks/wants/prays has a single agree meaning globally, … but one that in fact has such a global meaning.

It’s been more than ten years. Let’s drop the last requirement and let the rather talented group working on this come up with a solution that meets the other five (5) requirements.

It won’t be a universal solution but then neither is the WWW.

LDIF – Linked Data Integration Framework

Filed under: LDIF,Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 9:02 am

LDIF – Linked Data Integration Framework 0.1

From the webpage:

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain datasets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an initial alpha version of an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI aliases.

More comments will follow, but…

Isn’t this the reverse of the well-known synonym table in IR?

Instead of substituting synonyms in the query expression, the underlying data is being transformed to produce…a lack of synonyms?

No, not the reverse of a synonym table, in synonym table terms, we would lose the synonym table and transform the underlying textual data to use only a single term where before there were N terms, all of which occurred in the synonym table.

If I search for a term previously listed in the synonym table, but one replaced by a common term, my search result will be empty.

No more synonyms? That sounds like a bad plan to me.

June 2, 2011

Second International Workshop on Consuming Linked Data (COLD2011)

Filed under: Linked Data,LOD — Patrick Durusau @ 7:42 pm

Second International Workshop on Consuming Linked Data (COLD2011)

Important Dates:

Paper submission deadline: August 15, 2011, 23.59 Hawaii time
Acceptance notification: September 6, 2011
Camera-ready versions of accepted papers: September 15, 2011
Workshop date: October 23 or 24, 2011

From the website:

Abstract:

The quantity of published Linked Data is increasing dramatically. However, applications that consume Linked Data are not yet widespread. Current approaches lack methods for seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces. Addressing these issues requires well-founded research, including the development and investigation of concepts that can be applied in systems which consume Linked Data from the Web. Following the success of the 1st International Workshop on Consuming Linked Data, we organize the second edition of this workshop in order to provide a platform for discussion and work on these open research problems. The main objective is to provide a venue for scientific discourse -including systematic analysis and rigorous evaluation- of concepts, algorithms and approaches for consuming Linked Data.

Err “…lack methods for seamless integration of Linked Data from multiple sources…” has topic maps written all over it.

June 1, 2011

Silk – A Link Discovery Framework for the Web of Data

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 6:52 pm

Silk – A Link Discovery Framework for the Web of Data

From the website:

The Web of Data is built upon two simple ideas: First, to employ the RDF data model to publish structured data on the Web. Second, to set explicit RDF links between data items within different data sources. Background information about the Web of Data is found at the wiki pages of the W3C Linking Open Data community effort, in the overview article Linked Data – The Story So Far and in the tutorial on How to publish Linked Data on the Web.

The Silk Link Discovery Framework supports data publishers in accomplishing the second task. Using the declarative Silk – Link Specification Language (Silk-LSL), developers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language. Silk accesses the data sources that should be interlinked via the SPARQL protocol and can thus be used against local as well as remote SPARQL endpoints.

Of particular interest are the comparison operators:

A comparison operator evaluates two inputs and computes their similarity based on a user-defined metric.
The Silk framework currently supports the following similarity metrics, which return a similarity value between 0 (lowest similarity) and 1 (highest similarity) each:

Metric Description
levenshtein([float maxDistance], [float minValue], [float maxValue]) String similarity based on the Levenshtein metric.
jaro String similarity based on the Jaro distance metric.
jaroWinkler String similarity based on the Jaro-Winkler metric.
qGrams(int q) String similarity based on q-grams (by default q=2).
equality Return 1 if strings are equal, 0 otherwise.
inequality Return 0 if strings are equal, 1 otherwise.
num(float maxDistance, float minValue, float maxValue) Computes the numeric distance between two numbers and normalizes it using the threshold.
Parameters:
maxDistance The similarity score is 0.0 if the distance is bigger than maxDistance.
minValue, maxValue The minimum and maximum values which occur in the datasource
date(int maxDays) Computes the similarity between two dates (“YYYY-MM-DD” format). At a difference of “maxDays”, the metric evaluates to 0 and progresses towards 1 with a lower difference.
wgs84(string unit, float threshold, string curveStyle) Computes the geographical distance between two points.
Parameters:
unit The unit in which the distance is measured. Allowed values: “meter” or “m” (default) , “kilometer” or “km”

threshold Will result in a 0 for all bigger values than t, values below are varying with the curveStyle
curveStyle “linear” gives a linear transition, “logistic” uses the logistic function f(x)=1/(1+e^(x)) gives a more soft curve with a slow slope at the start and the end of the curve but a steep one in the middle.
Author: Konrad Höffner (MOLE subgroup of Research Group AKSW, University of Leipzig)

(better formatting is available at the original page but I thought the operators important enough to report in full here)

Definitely a step towards more than opaque mapping between links. Note for example that Silk – Link Specification Language declares why two or more links are mapped together. More could be said but this is a start in the right direction.

May 10, 2011

Special Issue on Linked Data for Science and Education

Filed under: Linked Data,LOD — Patrick Durusau @ 3:29 pm

Special Issue on Linked Data for Science and Education

The Semantic Web Journal has posted a call for papers on linked data for science and education.

Important dates:

Deadline for submissions: May 31 2011
Reviews due: July 15 2011
Final versions of accepted papers due: August 12 2011

Apologies, I missed this announcement when it came out in early February, 2011.

From the call:

The number of universities, research organizations, publishers and funding agencies contributing to the Linked Data cloud is constantly increasing. The Linked Data paradigm has been identified as a lightweight approach for data dissemination and integration, opening up new opportunities for the organization, integration, archiving and retrieval of research results and educational material. Obviously, this novel approach also raises new challenges regarding the integrity, adoption, use and sustainability of contents. A number of case studies from universities and research communities already demonstrate that Linked Data is not merely a novel way of exposing data on the Web, but that its principles help integrating related data, connecting scientists working on related topics, and improving scientific and educational workflows. The next challenges in creating a true Web of scientific and educational data include dealing with provenance, mapping vocabularies (i.e., ontologies), and organizational issues such as assessing costs and ensuring persistence and performance. In this special issue of the Semantic Web Journal, we want to collect the state of the art in Linked Data for science and education and identify upcoming challenges, focusing on technological aspects as well as social and legal implications.

Well, I like that:

The next challenges in creating a true Web of scientific and educational data include dealing with provenance, mapping vocabularies (i.e., ontologies), and organizational issues such as assessing costs and ensuring persistence and performance.

Link data together and then hope we can sort it out on the other end.

Doesn’t that sound a lot like Google?

Index data together and then hope we can sort it out on the other end.

April 28, 2011

Dataset linkage recommendation on the Web of Data

Filed under: Conferences,Entity Resolution,Linked Data,LOD — Patrick Durusau @ 3:18 pm

Dataset linkage recommendation on the Web of Data by Martijn van der Plaat (Master thesis).

Abstract:

We address the problem of, given a particular dataset, which candidate dataset(s) from the Web of Data have the highest chance of holding co-references, in order to increase the efficiency of coreference resolution. Currently, data publishers manually discover and select the right dataset to perform a co-reference resolution. However, in the near future the size of the Web of Data will be such that data publishers can no longer determine which datasets are candidate to map to. A solution for this problem is finding a method to automatically recommend a list of candidate datasets from the Web of Data and present this to the data publisher as an input for the mapping.

We proposed two solutions to perform the dataset linkage recommendation. The general idea behind our solutions is predicting the chance a particular dataset on the Web of Data holds co-references with respect to the dataset from the data publisher. This prediction is done by generating a profile for each dataset from the Web of Data. A profile is meta-data that represents the structure of a dataset, in terms of used vocabularies, class types, and property types. Subsequently, dataset profiles that correspond with the dataset profile from the data publisher, get a specific weight value. Datasets with the highest weight values have the highest chance of holding co-references.

A useful exercise but what happens when data sets have inconsistent profiles from different sources?

And for all the drum banging, only a very tiny portion of all available datasets are part of Linked Data.

How do we evaluate the scalability of such a profiling technique?

March 30, 2011

State of the LOD Cloud

Filed under: LOD,RDF,Semantic Web — Patrick Durusau @ 12:36 pm

State of the LOD Cloud

A more complete resource than the one I referenced in The Linking Open Data cloud diagram.

I haven’t seen any movement towards solving any of the fundamental identity issues with the LOD cloud.

On the other hand, topic mappers can make use of these URIs as names and specify other data published with those URIs to form an actual identification.

One that is reliably interchangeable with others.

I think the emphasis on URIs being dereferencable.

No one says what happens after a URI is dereferenced but that’s to avoid admitting that a URI is insufficient as an identifier.

December 17, 2010

CFP – Dealing with the Messiness of the Web of Data – Journal of Web Semantics

CFP – Dealing with the Messiness of the Web of Data – Journal of Web Semantics

From the call:

Research on the Semantic Web, which is now in its second decade, has had a tremendous success in encouraging people to publish data on the Web in structured, linked, and standardized ways. The success of what has now become the Web of Data can be read from the sheer number of triples available within the Linked-Open Data, Linked Life Data and Open-Government initiatives. However, this growth in data makes many of the established assumptions inappropriate and offers a number of new research challenges.

In stark contrast to early Semantic Web applications that dealt with small, hand-crafted ontologies and data-sets, the new Web of Data comes with a plethora of contradicting world-views and contains incomplete, inconsistent, incorrect, fast-changing and opinionated information. This information not only comes from academic sources and trustworthy institutions, but is often community built, scraped or translated.

In short: the Web of Data is messy, and methods to deal with this messiness are paramount for its future.

Now, we have two choices as the topic map community:

  • congratulate ourselves for seeing this problem long ago, high five each other, etc., or
  • step up and offer topic map solutions that incorporate as much of the existing SW work as possible.

I strongly suggest the second one.

Important dates:

We will aim at an efficient publication cycle in order to guarantee prompt availability of the published results. We will review papers on a rolling basis as they are submitted and explicitly encourage submissions well before the submission deadline. Submit papers online at the journal’s Elsevier Web site.

Submission deadline: 1 February 2011
Author notification: 15 June 2011

Revisions submitted: 1 August 2011
Final decisions: 15 September 2011
Publication: 1 January 2012

November 28, 2010

Names, Identifiers, LOD, and the Semantic Web

Filed under: LOD,Names,RDF,Semantic Web,Subject Identifiers — Patrick Durusau @ 5:28 pm

I have been watching the identifier debate in the LOD community with its revisionists, personal accounts and other takes on what the problem is, if there is a problem and how to solve the problem if there is one.

I have a slightly different question: What happens when we have a name/identifier?

Short of being present when someone points to or touches an object, themselves, you (if the TSA) and says a name or identifier, what happens?

Try this experiment. Take a sheet of paper and write: George W. Bush.

Now write 10 facts about George W. Bush.

Please circle which ones that you think must match to identify George W. Bush.

So, even though you knew the name George W. Bush, isn’t it fair to say that the circled facts are what you would use to identify George W. Bush?

Here’s is the fun part: Get a colleague or co-worker to do the same experiment. (Substitute Lady Gaga if your friends don’t know enough facts about George W. Bush.)

Now compare several sets of answers for the same person.

Working from the same name, you most likely listed different facts and different ones you would use to identify that subject.

Even though most of you would agree that some or all of the facts listed go with that person.

It sounds like even though we use identifiers/names, those just clue us in on facts, some of which we use to make the identification.

That’s the problem isn’t it?

A name or identifier can make us think of different facts (possibly identifying different subjects) and even if the same subject, we may use different facts to identify the subject.

Assuming we are at a set of facts (RDF graph, whatever) we need to know: What facts identify the subject?

And a subject may have different identifying properties, depending on the context of identification.

Questions:

  1. How to specify essential facts for identification as opposed to the extra ones?
  2. How to answer #1 for an RDF graph?
  3. How do you make others aware of your answer in #2?

Comments/suggestions?

November 19, 2010

“…an absolute game changer”

Filed under: Linked Data,LOD,Marketing,Semantic Web — Patrick Durusau @ 1:27 pm

Aldo Bucchi write that http://uriburner.com/c/DI463N is:

Single most powerful demo available. Really looking fwd to what’s coming next.

Let’s see how this shifts gears in terms of Linked Data comprehension.
Even in its current state, this is an absolute game changer.

I know this was not easy. My hat goes off to the team for their focus.

Now, just let me send this link out to some non-believers that have
been holding back my evangelization pipeline 😉

I may count as one of the “non-believers.” 😉

Before Aldo throws open the flood gates on his “evagenlization pipeline,” let me observe:

The elderly gentlemen appears in: Tropical grassland, Desert, Temperate grassland, Coniferous forest, Flooded grassland, Mountain grassland, Broadleaf forest, Tropical dry forest, Rainforest, Taiga, Tundra, Urban, Tropical coniferous forests, Mountains, Coastal, and Wetlands.

So he must get around a lot.

Only the BBC appears in Estuaries.

Granting that is a clever presentation of subjects that share a common locale and works fairly responsively but that hardly qualifies as a “…game changer…”

This project is a good experiment on making information more accessible.

Why aren’t the facts enough?

All Identifiers, All The Time – LOD As An Answer?

Filed under: Linked Data,LOD,RDA,Semantic Web,Subject Identity — Patrick Durusau @ 6:25 am

I am still musing over Thomas Neidhart’s comment:

To understand this identifier you would need implicit knowledge about the structure and nature of every possible identifier system in existence, and then you still do not know who has more information about it.

Aside from questions of universal identifier systems failing without exception in the past, which makes one wonder why this system should succeed, there are other questions.

Such as why would any system need to encounter every possible identifier system in existence?

That is the LOD effort has setup a strawman (apologies for the sexism) that it then proceeds to blow down.

If a subject has multiple identifiers in a set and my system recognizes only one out of three, what harm has come of the subject having the other two identifiers?

There is no processing overhead since by admission the system does not recognize the other identifier so it doesn’t process them.

The advantage being that some other system make recognize the subject on the basis of the other identifiers.

This post is a good example of that practice.

I had a category “Linked Data,” but I added a category this morning, “LOD,” just in case people search for it that way.

Why shouldn’t our computers adapt to how we use identifiers (multiple ones for the same subjects) rather than our attempting (and failing) to adapt to universal identifiers to make it easy for our computers?

« Newer Posts

Powered by WordPress