Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 3, 2012

Bipartite Graphs as Intermediate Model for RDF

Filed under: Hypergraphs,RDF,Semantic Web — Patrick Durusau @ 7:28 pm

Bipartite Graphs as Intermediate Model for RDF by Jonathan Hayes and Claudio Gutierrez.

Abstract:

RDF Graphs are sets of assertions in the form of subject-predicate-object triples of information resources. Although for simple examples they can be understood intuitively as directed labeled graphs, this representation does not scale well for more complex cases, particularly regarding the central notion of connectivity of resources.

We argue in this paper that there is need for an intermediate representation of RDF to enable the application of well-established methods from Graph Theory. We introduce the concept of Bipartite Statement-Value Graph and show its advantages as intermediate model between the abstract triple syntax and data structures used by applications. In the light of this model we explore issues like transformation costs, data/schema structure, the notion of connectivity, and database mappings.

A quite different take on the representation of RDF than in Is That A Graph In Your Cray? Here we encounter hypergraphs for modeling RDF.

Suggestions on how to rank graph representations of RDF?

Or perhaps better, suggestion on how to rank graph representations for use cases?

Putting the question of what (connections/properties) we want to model before the question of how (RDF, etc.) we intend to model it.

Isn’t that the right order?

Comments?

Populating the Semantic Web…

Filed under: Data Mining,Entity Extraction,Entity Resolution,RDF,Semantic Web — Patrick Durusau @ 7:28 pm

Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs by Kate Bryne.

I ran across this while looking for RDF graph material today. Delighted to find someone interested in the problem of what do we do with existing data, even if new data is in some semantic web format?

Abstract:

The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the SemanticWeb but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built.

Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives.

The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible.

Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates.

These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.

This will take some time to read but it looks quite enjoyable.

MapReduceXMT

Filed under: MapReduceXMT,OWL,RDF — Patrick Durusau @ 7:28 pm

MapReduceXMT from Sandia National Laboratories.

From the webpage:

Welcome to MapReduceXMT

MapReduceXMT is a library that ports the MapReduce paradigm to the Cray XMT.

MapReduceXMT is copyrighted and released under a Berkeley open source license. However, the code is still very much in development and there has not been a formal release of the software.

SPEED-MT Semantic Processing Executed Efficiently and Dynamically

This trac site is currently being used to house SPEED-MT, which contains a set of algorithms and data structures for processing semantic web data on the Cray XMT.

SPEED-MT Modules

  • Dictionary Encoding
  • Decoding
  • RDFS/OWL Closure
  • RDF Stats
  • RDF Dedup

OK, so this one is tied a little more closely to the Cray XMT. 😉

But modules are ones that are likely to be of interest for processing RDF triples/quads.

This was cited in “High-performance Computing Applied to Semantic Databases” article that I covered in Is That A Graph In Your Cray?

Is That A Graph In Your Cray?

Filed under: Cray,Graphs,Neo4j,RDF,Semantic Web — Patrick Durusau @ 7:27 pm

If you want more information about graph processing in Cray’s uRIKA (I did), try: High-performance Computing Applied to Semantic Databases by Eric L. Goodman, Edward Jimenez, David Mizell, Sinan al-Saffar, Bob Adolf, and David Haglin.

Abstract:

To-date, the application of high-performance computing resources to Semantic Web data has largely focused on commodity hardware and distributed memory platforms. In this paper we make the case that more specialized hardware can offer superior scaling and close to an order of magnitude improvement in performance. In particular we examine the Cray XMT. Its key characteristics, a large, global shared memory, and processors with a memory-latency tolerant design, offer an environment conducive to programming for the Semantic Web and have engendered results that far surpass current state of the art. We examine three fundamental pieces requisite for a fully functioning semantic database: dictionary encoding, RDFS inference, and query processing. We show scaling up to 512 processors (the largest configuration we had available), and the ability to process 20 billion triples completely in memory.

Unusual to see someone apologize for only having “…512 processors (the largest configuration we had available)….,” but that isn’t why I am citing the paper. 😉

The “dictionary encoding” (think indexing) techniques may prove instructive, even if you don’t have time on a Cray XMT. The techniques presented achieve a compression of the raw data between 3.2. and 4.4.

Take special note of the statement: “To simplify the discussion, we consider only semantic web data represented in N-Triples.” Actually the system presented processes only subject, edge, object triples. Unlike Neo4j, for instance, it isn’t a generalized graph engine.

Specialized hardware/software is great but let’s be clear about that upfront. You may need more than RDF graphs can offer. Like edges with properties.

Other specializations include, a process of “closure” has several simplifications to enable a single pass through the RDFS rule set and querying doesn’t allow a variable in the predicate position.

Granting that this results in a hardware/software combination that can claim “interactivity” on large data sets, but what is the cost of making that a requirement?

Take the best known “connect the dots” problem of this century, 9/11. Analysts did not need “interactivity” with large data sets measured in nano-seconds. Batch processing that lasted for a week or more would have been more than sufficient. Most of the information that was known was “known” by various parties for months.

More than that, the amount of relevant was quite small when compared to the “Semantic Web.” There were known suspects (as there are now), with known associates, with known travel patterns, so eliminating all the business/frequent flyers from travel data is a one time filter, plus any > 40 females traveling on US passports (grandmothers). Similar criteria can reduce information clutter, allowing analysts to focus on important data, as opposing to paging through “hits” in a simulation of useful activity.

I would put batch processing of graphs of relevant information against interactive churning of big data in a restricted graph model any day. How about you?

March 2, 2012

Breaking into the NoSQL Conversation

Filed under: NoSQL,RDF,Semantic Web — Patrick Durusau @ 8:05 pm

Breaking into the NoSQL Conversation by Rob Gonzalez.

Semantic Web Community: I’m disappointed in us! Or at least in our group marketing prowess. We have been failing to capitalize on two major trends that everyone has been talking about and that are directly addressable by Semantic Web technologies! For shame.

I’m talking of course about Big Data and NoSQL. Given that I’ve already given my take on how Semantic Web technology can help with the Big Data problem on SemanticWeb.com, this time around I’ll tackle NoSQL and the Semantic Web.

After all, we gave up SQL more than a decade ago. We should be part of the discussion. Heck, even the XQuery guys got in on the action early!

(much content omitted, read at your leisure)

AllegroGraph, Virtuoso, and Systap can all scale, and can all shard like Mongo. We have more mature, feature rich, and robust APIs via Sesame and others to interact with the data in these stores. So why aren’t we in the conversation? Is there something really obvious that I’m missing?

Let’s make it happen. For more than a decade our community has had a vision for how to build a better web. In the past, traditional tools and inertia have kept developers from trying new databases. Today, there are no rules. It’s high time we stepped it up. On the web we can compete with MongoDB directly on those use cases. In the enterprise we can combine the best of SQL and NoSQL for a new class of flexible, robust data management tools. The conversation should not continue to move so quickly without our voice.

I hate to disappoint but the reason the conversation is moving so quickly is the absence of the Semantic Web voice.

Consider my post earlier today about the new hardware/software release by Cray, A Computer More Powerful Than Watson. The release refers to RDF as a “graph format.”

With good reason. The uRIKA system doesn’t use RDF for reasoning at all. It materializes all the implied nodes and searches the materialized graph. Impressive numbers but reasoning it isn’t.

Inertia did not stop developers from trying new databases. New databases that met no viable (commercially that is) use cases went unused. What’s so hard to understand about that?

February 21, 2012

data modelling and FRBR WEMI ontology

Filed under: FRBR,RDF,Sets — Patrick Durusau @ 8:00 pm

data modelling and FRBR WEMI ontology

Jonathan Rochkind writes to defend the FRBR WEMI ontology:

Karen Coyle writes on the RDA listserv:

FRBR claims to be based on a “relational” model, as in “relational database.” That is not tomorrow’s data model; it is yesterday’s, although it is a step toward tomorrow’s model. The difficulty is that FRBR was conceived of in the early 1990′s, and completed in the late 1990′s. That makes it about 15 years old.

I think it would have been just as much a mistake to tie the FRBR model to an RDF model as it would have/was to tie it to a relational database model. Whatever we come up with is going to last us more than 15 years, and things will change again. Now, I’ll admit that I’m heretically still suspicious that an RDF data model will in fact be ‘the future’. But even if it is, there will be another future (or simultaneous futures plural).

And concludes:

I tend to think they should have just gone with ‘set theory’ oriented language, because it is, I think, the most clear, while still being abstract enough to make it harder to think the WEMI ontology is tied to some particular technology like relational databases OR linked data. I think WEMI gets it right regardless of whether you speak in the language of ‘relational’, ‘set theory’, ‘object orientation’ or ‘linked data’/RDF.

Leaving my qualms about RDF to one side, I write to point out that choosing “set theory” is a choice of a particular technology or if you like, tradition.

If that sounds odd, consider how many times you have used set theory in the last week, month, year? Unless you are a logician or introductory mathematics professor, the odds are that the number is zero (0) (or the empty set, {},for any logicians reading this post).

Choosing “set theory” is to choose a methodology that very few people use in practice. The vast majority of people make choices, evaluate outcomes, live complex lives innocent of the use of set theory.

I don’t object to FRBR or other efforts choosing to use “set theory” but recognize it is a minority practice.

One that elevates a minority over the majority of users.

February 7, 2012

Ignorance & Semantic Uniformity

Filed under: GoodRelations,Ontology,RDF — Patrick Durusau @ 4:34 pm

I saw Volkswagen Vehicles Ontology by Martin Hepp, in a tweet by Bob DuCharme today.

Being a former owner of a 1972 Super Beetle, I checked under

vvo:AudioAndNavigation

Only to find that cassette player wasn’t one of the options:

The class of audio and navigation choices or components (CD/DVD/SatNav, a “MonoSelectGroup” in automotive terminology), VW ID: 1

I searched the ontology for “Beetle” and came up empty.

Is ignorance the path to semantic uniformity?

February 4, 2012

ADMS Public Review is launched

Filed under: ADMS,Ontology,RDF — Patrick Durusau @ 3:40 pm

ADMS Public Review is launched

Public Review ends: 6 February 2012

From the post:

The ISA programme of the European Commission launched the public review of the Asset Description Metadata Schema (ADMS) on 6 January 2012 this will end on 6 February 2012 (inclusive).

From mid 2012, the Joinup platform, of the ISA programme, will make available a large number of semantic interoperability assets, described using ADMS, through a federation of asset repositories of Member States, standardisation bodies and other relevant stakeholders.

Apologies for the late notice but this item just came to my attention.

This is version 0.8 so unless the EC uses Hadoop numbering practices (jumping from 0.22 to 1.0) and such, I suspect there will be additional opportunities to comment.

ADMS 0.8 (has the following files):

At least as of today, 4 February 2012, the following two files don’t require you to answer if you are willing to participate in a post-download survey. I know every marketing department thinks their in-house and amateurish surveys are meaningful. Not. Ask a professional survey group if you really want to do surveys. Expensive but at least they will be meaningful.

These five (5) files require you to register and accept the post-download survey or answer: “No, I prefer to remain anonymous – start the download immediately.” five (5) times.

The ADMS_Specification-v0.8.zip file contains ADMS_Specification-v0.8.pdf (which is listed above).

The specification document is thirty-five (35) pages long so it won’t take you long to read.

I was puzzled by the documentation note (dcterms:abstract) in the adms08.rdf file that reads:

ADMS is intended as a model that facilitates federation and co-operation. It is not the primary intention that repository owners redesign or convert their current systems and data to conform to ADMS, but rather that ADMS can act as a common layer among repositories that want to exchange data.

But the examples found in ADMS_Examples-v0.8.zip are dated variously, 2011 – ADMS_Examples_Digitaliser_v0.03.pdf, 2010 – ADMS_Examples_ADMS_v0.03.pdf, ADMS_Examples_DCMES_v0.03.pdf, 2009 – ADMS_Examples_SKOS_v0.04.pdf, with version numbers, v0.03 and v.0.04 that leave doubt about the examples being current with the specification draft.

Morever, the examples are contrary to the goal of ADMS in that they represent presentation of data in ADMS rather than using ADMS as a target vocabulary. In other words, if you are a target vocabulary, give target vocabulary examples.

Do you have a feeling of deja vu reading these documents? Been here, done that? Which projects would you name off the top of your head that cover some, all or more than the ground covered here? (Extra points if you look up citations/URLs.)


Shameless self-promotion follows if you want to stop reading here.

It doesn’t look like my editing schedule is full for this year. Ghost or public editing of documentation or standards available. ODF 1.2 is an example of what is possible with a dedicated technical team like Sun had at Hamburg backing me as an editor. It is undergoing revision but no standard or document is ever perfect. Anyone who says differently is mis-informed or lying.

February 1, 2012

January 25, 2012

Searching and Browsing Linked Data with SWSE: the SemanticWeb Search Engine

Filed under: Linked Data,RDF,Search Engines,Semantic Web — Patrick Durusau @ 3:30 pm

Searching and Browsing Linked Data with SWSE: the SemanticWeb Search Engine by Aidan Hogan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres and Stefan Decker.

Abstract:

In this paper, we discuss the architecture and implementation of the SemanticWeb Search Engine (SWSE). Following traditional search engine architecture, SWSE consists of crawling, data enhancing, indexing and a user interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over RDF Web data { loosely also known as Linked Data { which implies unique challenges for the system design, architecture, algorithms, implementation and user interface. In particular, many challenges exist in adopting Semantic Web technologies for Web data: the unique challenges of the Web { in terms of scale, unreliability, inconsistency and noise { are largely overlooked by the current Semantic Web standards. Herein, we describe the current SWSE system, initially detailing the architecture and later elaborating upon the function, design, implementation and performance of each individual component. In so doing, we also give an insight into how current Semantic Web standards can be tailored, in a best-eff ort manner, for use on Web data. Throughout, we o ffer evaluation and complementary argumentation to support our design choices, and also off er discussion on future directions and open research questions. Later, we also provide candid discussion relating to the diffculties currently faced in bringing such a search engine into the mainstream, and lessons learnt from roughly six years working on the Semantic Web Search Engine project.

This is the paper that Ivan Herman mentions at Nice reading on Semantic Search.

It covers a lot of ground in fifty-five (55) pages but it doesn’t take long to hit an issue I wanted to ask you about.

At page 2, Google is described as follows:

In the general case, Google is not suitable for complex information gathering tasks requiring aggregation from multiple indexed documents: for such tasks, users must manually aggregate tidbits of pertinent information from various recommended heterogeneous sites, each such site presenting information in its own formatting and using its own navigation system. In e ffect, Google’s limitations are predicated on the lack of structure in HTML documents, whose machine interpretability is limited to the use of generic markup-tags mainly concerned with document rendering and linking. Although Google arguably makes the best of the limited structure available in such documents, most of the real content is contained in prose text which is inherently diffcult for machines to interpret. Addressing this inherent problem with HTML Web data, the Semantic Web movement provides a stack of technologies for publishing machine-readable data on the Web, the core of the stack being the Resource Description Framework (RDF).

A couple of observations:

Although Google needs no defense from me, I would argue that Google never set itself the task of aggregating information from indexed documents. Historically speaking, IR has always been concerned with returning relevant documents and not returning irrelevant documents.

Second, the lack of structure in HTML documents (although the article mixes in sites with different formatting) is no deterrent to a human reader aggregating “tidbits of pertinent information….” I rather doubt that writing all the documents in valid Springer LaTeX would make that much difference on the “tidbits of pertinent information” score.

This is my first pass through the article and I suspect it will take three or more to become comfortable with it.

Do you agree/disagree that the task of IR is to retrieve documents, not “tidbits of pertinent information?”

Do you agree/disagree that HTML structure (or lack thereof) is that much of an issue for interpretation of document?

Thanks!

Nice reading on Semantic Search

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 3:25 pm

Nice reading on Semantic Search by Ivan Herman.

From the post:

I had a great time reading a paper on Semantic Search[1]. Although the paper is on the details of a specific Semantic Web search engine (DERI’s SWSE), I was reading it as somebody not really familiar with all the intricate details of such a search engine setup and operation (i.e., I would not dare to give an opinion on whether the choice taken by this group is better or worse than the ones taken by the developers of other engines) and wanting to gain a good image of what is happening in general. And, for that purpose, this paper was really interesting and instructive. It is long (cca. 50 pages), i.e., I did not even try to understand everything at my first reading, but it did give a great overall impression of what is going on.

Interested to hear your take on Ivan’s comments on owl:sameAs.

The semantics of words, terms, ontology classes are not stable over time and/or users. If you doubt that statement, leaf through the Oxford English Dictionary for ten (10) minutes.

Moreover, the only semantics we “see” in words, terms or ontology classes are those we assign them. We can discuss the semantics of Hebrew words in the Dead Sea Scrolls but those are our semantics, not those of the original users of those words. May be close to what they meant, may not. Can’t say for sure because we can’t ask and would lack the context to understand the answer if we could.

Adding more terms to use as supplements to owl:sameAs just increases the chances for variation. And error if anyone is going to enforce their vision of broadMatch on usages of that term by others.

January 23, 2012

Semantic Web – Sweet Spot(s) and ‘Gold Standards’

Filed under: OWL,RDF,UMBEL,Wikipedia,WordNet — Patrick Durusau @ 7:43 pm

Mike Bergman posted a two-part series on how to make the Semantic Web work:

Seeking a Semantic Web Sweet Spot

In Search of ‘Gold Standards’ for the Semantic Web

Both are worth your time to read but the second sets the bar for “Gold Standards” for the Semantic Web as:

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

There is one criteria that Mike leaves out: Choice of a majority of users.

Use by a majority of users is a sweet spot that brooks no argument.

January 19, 2012

RDF silos

Filed under: Linked Data,RDF — Patrick Durusau @ 7:35 pm

Bibliographic Framework: RDF and Linked Data

Karen Coyle writes:

With the newly developed enthusiasm for RDF as the basis for library bibliographic data we are seeing a number of efforts to transform library data into this modern, web-friendly format. This is a positive development in many ways, but we need to be careful to make this transition cleanly without bringing along baggage from our past.

Recent efforts have focused on translating library record formats into RDF with the result that we now have:
    ISBD in RDF
    FRBR in RDF
    RDA in RDF

and will soon have
    MODS in RDF

In addition there are various applications that convert MARC21 to RDF, although none is “official.” That is, none has been endorsed by an appropriate standards body.

Each of these efforts takes a single library standard and, using RDF as its underlying technology, creates a full metadata schema that defines each element of the standard in RDF. The result is that we now have a series of RDF silos, each defining data elements as if they belong uniquely to that standard. We have, for example, at least four different declarations of “place of publication”: in ISBD, RDA, FRBR and MODS, each with its own URI. There are some differences between them (e.g. RDA separates place of publication, manufacture, production while ISBD does not) but clearly they should descend from a common ancestor:
(emphasis added)

Karen makes a very convincing argument about RDF silos and libraries.

I am less certain about her prescription that libraries concentrate on creating data and build records for that data separately.

In part because there aren’t any systems where data exists separate from either an implied or explicit structure to access it. And those structures are just as much “data” as the “data” they enclose. We may not often think of it that way but shortcomings on our part don’t alter our data and the “data” that encloses it.

January 17, 2012

The ClioPatria Semantic Web server

Filed under: Prolog,RDF,Semantic Web — Patrick Durusau @ 8:18 pm

The ClioPatria Semantic Web server

I ran across this whitepaper about the ClioPatria Semantic Web server that reads in part:

What is ClioPatria?

ClioPatria is a (SWI-)Prolog hosted HTTP application-server with libraries for Semantic Web reasoning and a set of JavaScript libraries for presenting results in a browser. Another way to describe ClioPatria is as “Tomcat+Sesame (or Jena) with additional reasoning libraries in Prolog, completed by JavaScript presentation components”.

Why is ClioPatria based on Prolog?

Prolog is a logic-based language using a simple depth-first resolution strategy (SLD resolution). This gives two readings to the same piece of code: the declarative reading and the procedural reading. The declarative reading facilitates understanding of the code and allows for reasoning about it. The procedural reading allows for specifying algorithms and sequential aspects of the code, something which we often need to describe interaction. In addition, Prolog is reflexive: it can reason about Prolog programs and construct them at runtime. Finally, Prolog is, like the RDF triple-model, relational. This match of paradigms avoids the complications involved with using Object Oriented languages for handling RDF (see below). We illustrate the fit between RDF and Prolog by translating an example query from the official SPARQL document:…

Just in case you are interested in RDF or Prolog or both.

January 16, 2012

Introducing Meronymy SPARQL Database Server

Filed under: RDF,Semantic Web,SPARQL — Patrick Durusau @ 2:33 pm

Introducing Meronymy SPARQL Database Server

Inge Henriksen writes:

I am pleased to announce today that the Meronymy SPARQL Database Server is ready for release later in 2012. Meronymy SPARQL Database Server is a high performance RDF Enterprise Database Management System (DBMS).

Our goal has been to make a really fast, ACID, OS portable, user friendly, secure, SPARQL-driven RDF database server usable with most programming languages.

Let’s not start any language wars about Meronymy being written in C++/assembly, 😉 , and concentrate on its performance in actual use.

Suggested RDF data sets to use to test that performance? (Knowing Inge I trust it is fast but the question is how fast under what circumstances?)

Or other RDF engines to test along side of it?

PS: If you don’t know SPARQL, check out Learning SPARQL by Bob Ducharme.

January 13, 2012

Meronymy SPARQL Database Server

Filed under: RDF,SPARQL — Patrick Durusau @ 8:15 pm

Meronymy SPARQL Database Server

Inge Henriksen writes:

We are pleased to announce that the Meronymy SPARQL Database Server is ready for release later in 2012. Those interested in our RDF database server software should consider registering today; those that do get exclusive early access to beta software in the upcoming closed beta testing period, insider news on the development progress, get to submit feature requests, and otherwise directly influence the finished product.

From the FAQ we learn some details:

A: All components in the database server and its drivers have been programmed from scratch so that we could optimize them in terms of their performance.
We developed the database server in C++ since this programming language has the most potential for optimalization, there are also some inline assembly at key locations in the programming code.
Some more components that makes our database management system very fast:

  • In-process query optimizer; determines the most efficient way to execute a query.
  • In-proces memory manager; for much faster memory allocation and deallocation than the operating system can provide.
  • In-process multithreaded HTTP server; for much faster SPARQL Protocol endpoint than through a standard out-of-process web server.
  • In-process multithreaded TCP/IP-listener with thread pooling; for efficient thread managment.
  • In-process directly coded lexical analyzer; for efficient query parsing.
  • Snapshot isolation; for fast transaction processing.
  • B+ trees; for fast indexing
  • In-process stream-oriented XML parser; for fast RDF/XML parsing.
  • A RDF data model; for no data model abstraction layers which results in faster processing of data.

I’m signing up for the beta. How about you?

January 6, 2012

I-CHALLENGE 2012 : Linked Data Cup

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 11:39 am

I-CHALLENGE 2012 : Linked Data Cup

Dates:

When Sep 5, 2012 – Sep 7, 2012
Where Graz, Austria
Submission Deadline Apr 2, 2012
Notification Due May 7, 2012
Final Version Due Jun 4, 2012

From the call for submissions:

The yearly organised Linked Data Cup (formerly Triplification Challenge) awards prizes to the most promising innovation involving linked data. Four different technological topics are addressed: triplification, interlinking, cleansing, and application mash-ups. The Linked Data Cup invites scientists and practitioners to submit novel and innovative (5 star) linked data sets and applications built on linked data technology.

Although more and more data is triplified and published as RDF and linked data, the question arises how to evaluate the usefulness of such approaches. The Linked Data Cup therefore requires all submissions to include a concrete use case and problem statement alongside a solution (triplified data set, interlinking/cleansing approach, linked data application) that showcases the usefulness of linked data. Submissions that can provide measurable benefits of employing linked data over traditional methods are preferred.
Note that the call is not limited to any domain or target group. We accept submissions ranging from value-added business intelligence use cases to scientific networks to the longest tail of information domains. The only strict requirement is that the employment of linked data is very well motivated and also justified (i.e. we rank approaches higher that provide solutions, which could not have been realised without linked data, even if they lack technical or scientific brilliance). (emphasis added)

I don’t know what the submissions are going to look like but the conference organizers should get high marks for academic honesty. I don’t think I have ever seen anyone say:

we rank approaches higher that provide solutions, which could not have been realised without linked data, even if they lack technical or scientific brilliance

We have all seen challenges with qualifying requirements but I don’t recall any that would privilege lesser work because of a greater dependence on a requirement. Or at least that would publicly claim that was the contest policy. Have there been complaints from technically or scientifically brilliant approaches about judging in the past?

Will have to watch the submissions and results to see if technically or scientifically brilliant approaches get passed over in favor of lesser approaches. Will be a signal to all first rate competitors to seek recognition elsewhere.

December 22, 2011

Drs. Wood & Seuss Explain RDF in Two Minutes

Filed under: RDF,Semantic Web,Semantics — Patrick Durusau @ 7:38 pm

Drs. Wood & Seuss Explain RDF in Two Minutes by Eric Franzon.

From the post:

“How would you explain RDF to my grandmother? I still don’t get it…” a student recently asked of David Wood, CTO of 3Roundstones. Wood was speaking to a class called “Linked Data Ventures” and was made up of students from the MIT Computer Science Department and the Sloan School of Business. He responded by creating a slide deck and subsequent video explaining the Resource Description Framework using the classic Dr. Seuss style of rhyming couplets and the characters Thing 1 and Thing 2.

I hope this student’s grandmother found this as enjoyable as I did. (Video after the jump).

This is a great explanation of RDF. You won’t be authoring RDF after the video but for you will have the basics.

Take this as a goad to come up with something similar for topic maps and other semantic technologies.

December 17, 2011

Open Source News: “Hating Microsoft/IBM/Oracle/etc is not a strategy”

Filed under: Open Source,RDF,Topic Maps — Patrick Durusau @ 7:55 pm

Publishing News: “Hating Amazon is not a strategy”

Sorry, the parallels to the open source community and the virgins, hermits and saints that regularly abuse the vendors who support most of the successful open source projects, either directly or indirectly, was just too obvious to pass up. Apologies to Don Linn for stealing his line.

By the same token, hating RDF isn’t a strategy either. 😉

Which is why I have come to think that RDF instances should be consumed and processed as seems most appropriate to the situation. RDF is just another data format and what we make of it is an answer to be documented as part of our processing of that data. Just as any other data source. Most of which are not going to be RDF.

SDShare

Filed under: RDF,SDShare,Semantic Web,Semantics — Patrick Durusau @ 7:54 pm

SDShare (PDF file)

According to the blog entry dated 16 December 2011, with a pointer to this presentation, this is a “recent” presentation. But the presentation has a copyright claim dated 2010. So it is either nearly a year old or it is one of those timeless artifacts on the web.

The ones that have no reliable indication of a date of composition or publishing. Appropriate for the ephemera that make up the eternal “now” of the WWW. Less appropriate for important technology documents, particularly ones that aspire to be ISO standards in the near future.

The slide deck is a good overview of the goals of SDShare, if a bit short in terms of the actual details. I would suggest using the slide deck to interest others in learning more and then passing onto them the original SDShare document.

I would quibble with the claim at slide 34 that RDF data makes “…merging simple.” So far as I know, RDF never specifies what happens when you have multiple distinct and perhaps inconsistent values for the same property. Perhaps I have overlooked that in the plethora of RDF standards, revisions and retreats.

December 7, 2011

USEWOD 2012 Data Challenge

Filed under: Contest,Linked Data,RDF,Semantic Web,Semantics — Patrick Durusau @ 8:08 pm

USEWOD 2012 Data Challenge

From the website:

The USEWOD 2012 Data Challenge invites research and applications built on the basis of USEWOD 2012 Dataset.

Accepted submissions will be presented at USEWOD2012, where a winner will be chosen. Examples of analyses and research that could be done with the dataset are the following (but not limited to those):

  • correlations between linked data requests and real-world events
  • types of structured queries
  • linked data access vs. conventional access
  • analysis of user agents visiting the sites
  • geographical analysis of requests
  • detection and visualisation of trends
  • correlations between site traffic and available datasets
  • etc. – let your imagination run wild!

USEWOD 2012 Dataset

The USEWOD dataset consists of server logs from from two major web servers publishing datasets on the Web
of linked data. In particular, the dataset contains logs from:

  • DBPedia: slices of log data
    spanning several months from
    the linked data twin of Wikipedia, one of the focal points of the Web of data.
    The logs were kindly made available to us for the challenge
    by OpenLink Software!
    Further details about this part of the dataset to follow.
  • SWDF:
    Semantic Web Dog Food is a
    constantly growing dataset of publications, people and organisations in the Web and Semantic Web area,
    covering several of the major conferences and workshops, including WWW, ISWC and ESWC. The logs
    contain two years of requests to the server from about 12/2008 until 12/2010.
  • Linked Open Geo Data A dataset about geographical data.
  • Bio2RDF Linked Data for life sciences.

Data sets are still under construction. Organizers advise that data sets should be available next week.

Your results should be reported as short papers and are due by 15 February 2011.

December 3, 2011

How to Execute the Research Paper

Filed under: Annotation,Biomedical,Dynamic Updating,Linked Data,RDF — Patrick Durusau @ 8:21 pm

How to Execute the Research Paper by Anita de Waard.

I had to create the category, “dynamic updating,” to at least partially capture what Anita describes in this presentation. I would have loved to be present to see it in person!

The gist of the presentation is that we need to create mechanisms to support research papers being dynamically linked to the literature and other resources. One example that Anita uses is linking a patient’s medical records to reports in literature with professional tools for the diagnostician.

It isn’t clear how Linked Data (no matter how generously described by Jeni Tennison) could be the second technology for making research papers linked to other data. In part because as Jeni points out, URIs are simply more names for some subject. We don’t know if that name is for the resource or something the resource represents. Makes reliable linking rather difficult.

BTW, the web lost its ability to grow in a “gradual and sustainable way” when RDF/Linked Data introduced the notion that URIs cannot be allowed to fail. If you try to reason based on something that fails, the reasoner falls on its side. Not nearly as robust as allowing semantic 404’s.

Anita’s third step, an integrated workflow is certainly the goal to which we should be striving. I am less convinced about the mechanisms, such as generating linked data stores in addition to the documents we already have, are the way forward. For documents, for instance, why do we need to repeat data they already possess? Why can’t documents represent their contents themselves? Oh, because that isn’t how Linked Data/RDF stores work.

Still, I would highly recommend this slide deck and that you catch any presentation by Anita that you can.

December 1, 2011

November 27, 2011

Top Three Technologies to Tame the Big Data Beast

Filed under: Description Logic,RDF,Semantic Web — Patrick Durusau @ 8:51 pm

Top Three Technologies to Tame the Big Data Beast by Steve Hamby.

I would re-order some of Steve’s remarks. For example, on the Semantic Web, why not put those paragraphs first:

The first technology needed to tame Big Data — derived from the “memex” concept — is semantic technology, which loosely implements the concept of associative indexing. Dr. Bush is generally considered the godfather of hypertext based on the associative indexing concept, per his 1945 article. The Semantic Web, paraphrased from a definition by the World Wide Web Consortium (W3C), extends hyperlinked Web pages by adding machine-readable metadata about the Web page, including relationships across Web pages, thus allowing machine agents to process the hyperlinks automatically. The W3C provides a series of standards to implement the Semantic Web, such as Web Ontology Language (OWL), Resource Description Framework (RDF), Rule Interchange Format (RIF), and several others.

The May 2001 Scientific American article “The Semantic Web” by Tim Berners-Lee, Jim Hendler, and Ora Lassila described the Semantic Web as agents that query ontologies representing human knowledge to find information requested by a human. OWL ontology is based on Description Logics, which are both expressive and decidable, and provide a foundation for developing precise models about various domains of knowledge. These ontologies provide the “memory index” that enables searches across vast amounts of data to return relevant, actionable information, while addressing key data trust challenges as well. The ability to deliver semantics to a mobile device, such as what the recent release of the iPhone 4S does with Siri, is an excellent step in taming the Big Data beast, since users can get the data they need when and where they need it. Big Data continues to grow, but semantic technologies provide the needed check points to properly index vital information in methods that imitate the way humans think, as Dr. Bush aptly noted.

Follow that with the amount of data recitation and the comments about Vannevar Bush:

In the July 1945 issue of The Atlantic Monthly, Dr. Vannevar Bush’s famous essay, “As We May Think,” was published as one of the first articles addressing Big Data, information overload, or the “growing mountain of research” as stated in the article. The 2010 IOUG Database Growth Survey, conducted in July-August 2010, estimates that more than a zettabyte (or a trillion gigabytes) of data exists in databases, and that 16 percent of organizations surveyed reported a data growth rate in excess of 50 percent annually. A Gartner survey, also conducted in July-August 2010, reported that 47 percent of IT staffers surveyed ranked data growth as one of the top three challenges faced by their IT organization. Based on two recent IBM articles derived from their CIO Survey, one in three CIOs make decisions based on untrusted data; one in two feel they do not have the data they need to make an informed decision; and 83 percent cite better analytics as a top concern. A recent survey conducted for MarkLogic asserts that 35 percent of respondents believe their unstructured data sources will surpass their structured data sources in size in the next 36 months, while 86 percent of respondents claim that unstructured data is important to their organization. The survey further asserts that only 11 percent of those that consider unstructured data important have an infrastructure that addresses unstructured data.

Dr. Bush conceptualized a “private library,” coined “memex” (mem[ory ind]ex) in his essay, which could ingest the “mountain of research,” and use associative indexing — how we think — to correlate trusted data to support human decision making. Although Dr. Bush conceptualized “memex” as a desk-based device complete with levers, buttons, and a microfilm-based storage device, he recognized that future mechanisms and gadgetry would enhance the basic concepts. The core capabilities of “memex” were needed to allow man to “encompass the great record and to grow in the wisdom of race experience.”

That would allow exploration of questions and comments like:

1) With a zettabyte of data and more coming in every day, precisely how are we going to create/impose OWL ontologies to develop “…precise models about various domains of knowledge?”

2) Curious on what grounds hyperlinking is considered the equivalent of associative indexing? Hyperlinks can be used by indexes but hyperlinking isnt indexing. Wasn’t then, isn’t now.

3) The act of indexing is collecting references to a list of subjects. Imposing RDF/OWL may be preparatory steps towards indexing but are not indexing in and of themselves.

4) Description Logics are decidable but why does Steve think human knowledge can be expressed in decidable fashion? There is a vast amount of human knowledge in religion, philosophy, politics, ethics, economics, etc., that cannot be expressed in decidable fashion. Parking regulations can be expressed in decidable fashion, I think, but I don’t know if they are worth the trouble of RDF/OWL.

5) For that matter, where does Steve get the idea that human knowledge is precise? I suppose you could have made that argument in the 1890’s, except for some odd cases, classical physics was sufficient. At least until 1905. (Hint: Think of Albert Einstein.) Human knowledge is always provisional, uncertain and subject to revision. The CERN has apparently observed neutrinos going faster than the speed of light, for example. More revisions of physics are on the way.

Part of what we need to tame the big data “beast” is acceptance that we need information systems that are like ourselves.

That is to say information systems that are tolerant of imprecision, perhaps even inconsistency, that don’t offer a false sense of decidability and omniscience. Then at least we can talk about and recognize the parts of big data that remain to be tackled.

November 8, 2011

Jena

Filed under: Jena,RDF,RDFa — Patrick Durusau @ 7:44 pm

Jena

Did you know that Jena is incubating at Apache now?

Welcome to the Apache Jena project! Jena is a Java framework for building Semantic Web applications. Jena provides a collection of tools and Java libraries to help you to develop semantic web and linked-data apps, tools and servers.

The Jena Framework includes:

  • an API for reading, processing and writing RDF data in XML, N-triples and Turtle formats;
  • an ontology API for handling OWL and RDFS ontologies;
  • a rule-based inference engine for reasoning with RDF and OWL data sources;
  • stores to allow large numbers of RDF triples to be efficiently stored on disk;
  • a query engine compliant with the latest SPARQL specification
  • servers to allow RDF data to be published to other applications using a variety of protocols, including SPARQL

Apache Incubator Apache Jena is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator project. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Rob Weir has pointed out that since ODF (OpenDocument Format) 1.2 includes support for RDFa and RDF XML that Jena may have a role to play in ODF’s future.

You can learn more about ODF 1.2 at the OpenDocument TC.

Adding support to the ODFToolkit for RDFa/RDF and/or demonstrating the benefits of RDFa/RDF in ODF 1.2 would be most welcome!

November 7, 2011

CumulusRDF

Filed under: Cassandra,RDF — Patrick Durusau @ 7:28 pm

CumulusRDF

From Andreas Harth and Günter Ladwig:

[W]e are happy to announce the first public release of CumulusRDF, a Linked Data server that uses Apache Cassandra [1] as a cloud-based storage backend. CumulusRDF provides a simple HTTP interface [2] to manage RDF data stored in an Apache Cassandra cluster.

Features
* By way of Apache Cassandra, CumulusRDF provides distributed, fault-tolerant and elastic RDF storage
* Supports Linked Data and triple pattern lookups
* Proxy mode: CumulusRDF can act as a proxy server [3] for other Linked Data applications, allowing to deploy any RDF dataset as Linked Data

This is a first beta release that is still somewhat rough around the edges, but the basic functionality works well. The HTTP interface is work-in-progress. Eventually, we plan to extend the storage model to support quads.

CumulusRDF is available from http://code.google.com/p/cumulusrdf/

See http://code.google.com/p/cumulusrdf/wiki/GettingStarted to get started using CumulusRDF.

There is also a paper [4] on CumulusRDF that I presented at the Scalable Semantic Knowledge Base Systems (SSWS) workshop at ISWC last week.

Cheers,
Andreas Harth and Günter Ladwig

[1] http://cassandra.apache.org/
[2] http://code.google.com/p/cumulusrdf/wiki/HttpInterface
[3] http://code.google.com/p/cumulusrdf/wiki/ProxyMode
[4] http://people.aifb.kit.edu/gla/cumulusrdf/cumulusrdf-ssws2011.pdf

Everybody knows I hate to be picky but the abstract of [4] promises:

Results on a cluster of up to 8 machines indicate that CumulusRDF is competitive to state-of-the-art distributed RDF stores.

But I didn’t see any comparison to “state-of-the-art” RDF stores, distributed or not. Did I just overlook something?

I ask because I think this approach has promise, at least as an exploration of indexing strategies for RDF and how usage scenarios may influence those strategies. But that will be difficult to evaluate in the absence of comparison to less imaginative approaches to RDF indexing.

October 14, 2011

MongoGraph – MongoDB Meets the Semantic Web

Filed under: MongoDB,RDF,Semantic Web,SPARQL — Patrick Durusau @ 6:24 pm

MongoGraph – MongoDB Meets the Semantic Web

From the post (Franz Inc.):

Recorded Webcast: MongoGraph – MongoDB Meets the Semantic Web From October 12, 2011

MongoGraph is an effort to bring the Semantic Web to MongoDB developers. We implemented a MongoDB interface to AllegroGraph to give Javascript programmers both Joins and the Semantic Web. JSON objects are automatically translated into triples and both the MongoDB query language and SPARQL work against your objects.

Join us for this webcast to learn more about working on the level of objects instead of individual triples, where an object would be defined as all the triples with the same subject. We’ll discuss the simplicity of the MongoDB interface for working with objects and all the properties of an advanced triplestore, in this case joins through SPARQL queries, automatic indexing of all attributes/values, ACID properties all packaged to deliver a simple entry into the world of the Semantic Web.

I haven’t watched the video, yet, but:

working on the level of objects instead of individual triples, where an object would be defined as all the triples with the same subject.

certainly caught my eye.

Curious, if this means simply using the triples as sources of values and not “reasoning” with them?

October 5, 2011

In Defense of Ambiguity

Filed under: Ambiguity,RDF,Semantic Web — Patrick Durusau @ 6:52 pm

In Defense of Ambiguity by Patrick J. Hayes and Harry A. Halpin.

Abstract:

URIs, a universal identification scheme, are different from human names insofar as they can provide the ability to reliably access the thing identified. URIs also can function to reference a non-accessible thing in a similar manner to how names function in natural language. There are two distinctly different relationships between names and things: access and reference. To confuse the two relations leads to underlying problems with Web architecture. Reference is by nature ambiguous in any language. So any attempts by Web architecture to make reference completely unambiguous will fail on the Web. Despite popular belief otherwise, making further ontological distinctions often leads to more ambiguity, not less. Contrary to appeals to Kripke for some sort of eternal and unique identification, reference on the Web uses descriptions and therefore there is no unambiguous resolution of reference. On the Web, what is needed is not just a simple redirection, but a uniform and logically consistent manner of associating descriptions with URIs that can be done in a number of practical ways that should be made consistent.

Highly readable critique with passages such as:

There are two distinct relationships between names and things: reference and access. The architecture of the Web determines access, but has no direct influence on reference. Identifiers like URIs can be considered types of names. It is important to distinguish these two possible different relationships between a name and a thing.

1. accesses, meaning that the name provides a causal pathway to the thing, perhaps mediated by the Web.

2. refers to, meaning that the name is being used to mention the thing.

Current practice in Web Architecture uses “identifies” to mean both or either of these, apparently in the belief that they are synonyms. They are not, and to think of them as being the same is to be profoundly confused. For example, when uttering the name “Eiffel Tower” one does not in anyway get magically transported to the Eiffel Tower. One can talk about it, have beliefs, plan a trip there, and otherwise have intentions about the Eiffel Tower, but the name has no causal path to the Eiffel Tower itself. In contrast, the URI http://www.tour-eiffel.fr/ offers us access to a group of Web pages via an HTTP-compliant agent. A great deal of the muddle Web architecture finds itself in can be directly traced to this confusion between access and reference.

The solution proffered by Hayes and Halpin:

Regardless of the details, the use of any technology in Web architecture to distinguish between access and reference, including our proposed ex:refersTo and ex:describedBy, does nothing more than allow the author of a URI to explain how they would like the URI to be used.

For those interested in previous recognitions of this distinction, see <resourceRef> and <subjectIndicatorRef> in XTM 1.0.

October 4, 2011

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 7:57 pm

Efficient Multidimensional Blocking for Link Discovery without losing Recall

Jack Park did due diligence on the SILK materials before I did and forwarded a link to this paper.

Abstract:

Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several di fferent similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall.

From deeper in the paper:

If the similarity between two entities exceeds a threshold $\theta$, a link between these two entities is generated. $sim$ is computed by evaluating a link specification $s$ (in record linkage typically called linkage decision rule [23]) which specifies the conditions two entities must fulfi ll in order to be interlinked.

If I am reading this paper correctly, there isn’t a requirement (as in record linkage) that we normalized the data to a common format before writing the rule for comparisons. That in and of itself is a major boon. To say nothing of the other contributions of this paper.

SILK – Link Discovery Framework Version 2.5 released

Filed under: Linked Data,LOD,RDF,Semantic Web,SPARQL — Patrick Durusau @ 7:54 pm

SILK – Link Discovery Framework Version 2.5 released

I was quite excited to see under “New Data Transformations”…”Merge Values of different inputs.”

But the documentation for Transformation must be lagging behind or I have a different understanding of what it means to “Merge Values of different inputs.”

Perhaps I should ask: What does SILK mean by “Merge Values of different inputs?”

Picking out an issue that is of particular interest to me is not meant to be a negative comment on the project. An impressive bit of work for any EU funded project.

Another question: Has anyone looked at the SILK- Link Specification Language (SILK-LSL) as an input into declaring equivalence/processing for arbitrary data objects? Just curious.

Robert Isele posted this announcement about SILK on October 3, 2011:

we are happy to announce version 2.5 of the Silk Link Discovery Framework for the Web of Data.

The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Using the declarative Silk – Link Specification Language (Silk-LSL), developers can specify the linkage rules data items must fulfill in order to be interlinked. These linkage rules may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language.

Linkage rules can either be written manually or developed using the Silk Workbench. The Silk Workbench, is a web application which guides the user through the process of interlinking different data sources.

Version 2.5 includes the following additions to the last major release 2.4:

(1) Silk Workbench now includes a function to learn linkage rules from the reference links. The learning function is based on genetic programming and capable of learning complex linkage rules. Similar to a genetic algorithm, genetic programming starts with a randomly created population of linkage rules. From that starting point, the algorithm iteratively transforms the population into a population with better linkage rules by applying a number of genetic operators. As soon as either a linkage rule with a full f-Measure has been found or a specified maximum number of iterations is reached, the algorithm stops and the user can select a linkage rule.

(2) A new sampling tab allows for fast creation of the reference link set. It can be used to bootstrap the learning algorithm by generating a number of links which are then rated by the user either as correct or incorrect. In this way positive and negative reference links are defined which in turn can be used to learn a linkage rule. If a previous learning run has already been executed, the sampling tries to generate links which contain features which are not yet covered by the current reference link set.

(2) The new help sidebar provides the user with a general description of the current tab as well as with suggestions for the next steps in the linking process. As new users are usually not familiar with the steps involved in interlinking two data sources, the help sidebar currently provides basic guidance to the user and will be extended in future versions.

(3) Introducing per-comparison thresholds:

  • On popular request, thresholds can now be specified on each comparison.
  • Backwards-compatible: Link specifications using a global threshold can still be executed.

(4) New distance measures:

  • Jaccard Similarity
  • Dice’s coefficient
  • DateTime Similarity
  • Tokenwise Similarity, contributed by Florian Kleedorfer, Research Studios Austria

(5) New data transformations:

  • RemoveEmptyValues
  • Tokenizer
  • Merge Values of multiple inputs

(6) New DataSources and Outputs

  • In addition to reading from SPARQL endpoints, Silk now also supports reading from RDF dumps in all common formats. Currently the data set is held in memory and it is not available in the Workbench yet, but future versions will improve this.
  • New SPARQL/Update Output: In addition to writing the links to a file, Silk now also supports writing directly to a triple store using SPARQL/Update.

(7) Various improvements and bugfixes

———————————————————————————

More information about the Silk Link Discovery Framework is available at:

http://www4.wiwiss.fu-berlin.de/bizer/silk/

The Silk framework is provided under the terms of the Apache License, Version 2.0 and can be downloaded from:

http://www4.wiwiss.fu-berlin.de/bizer/silk/releases/

The development of Silk was supported by Vulcan Inc. as part of its Project Halo (www.projecthalo.com) and by the EU FP7 project LOD2-Creating Knowledge out of Interlinked Data (http://lod2.eu/, Ref. No. 257943).

Thanks to Christian Becker, Michal Murawicki and Andrea Matteini for contributing to the Silk Workbench.

« Newer PostsOlder Posts »

Powered by WordPress