Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 1, 2017

Decentralization and Linked Data: Open Review for DeSemWeb2017 at ISWC2017

Filed under: Decentralized Internet,Linked Data — Patrick Durusau @ 6:16 pm

A recent email from the organizers of DeSemWeb2017 reads:

Below are 14 contributions on the topic of decentralization and Linked Data. These were shared in reply to the call for contributions of DeSemWeb2017, an ISWC2017 workshop on Decentralizing the Semantic Web.

We invite everyone to add open reviews to any of these contributions. This ensures fair feedback and transparency of the process.

Semantic Web in the Fog of Browsers by Pascal Molli, Hala Skaf-Molli https://openreview.net/forum?id=ByFHXFy8W&noteId=ByFHXFy8W

Decentralizing the Semantic Web: Who will pay to realize it? by Tobias Grubenmann, Daniele Dell’Aglio, Abraham Bernstein, Dmitry Moor, Sven Seuken https://openreview.net/forum?id=ryrkDpyIW&noteId=ryrkDpyIW

On a Web of Data Streams by Daniele Dell’Aglio, Danh Le Phuoc, Anh Le-Tuan, Muhammad Intizar Ali, Jean-Paul Calbimonte https://openreview.net/forum?id=HyU_JWLU-&noteId=HyU_JWLU-

Towards VoIS: a Vocabulary of Interlinked Streams by Yehia Abo Sedira, Riccardo Tommasini, Emanuele Della Valle https://openreview.net/forum?id=H1ODzYPLZ&noteId=H1ODzYPLZ

Agent Server: Semantic Agent for Linked Data by Teofilo Chambilla, Claudio Gutierrez https://openreview.net/forum?id=H1aftW_Lb&noteId=H1aftW_Lb

The tripscore Linked Data client: calculating specific summaries over large time series by David Chaves Fraga, Julian Rojas, Pieter-Jan Vandenberghe, Pieter Colpaer, Oscar Corcho https://openreview.net/forum?id=H16ZExYLb&noteId=H16ZExYLb

Agreements in a De-Centralized Linked Data Based Messaging System by Florian Kleedorfer, Heiko Friedrich, Christian Huemer https://openreview.net/forum?id=B1AK_bKL-&noteId=B1AK_bKL-

Specifying and Executing User Agent Behaviour with Condition-Action Rules by Andreas Harth, Tobias Käfer https://openreview.net/forum?id=BJ67PfFLZ&noteId=BJ67PfFLZ

VisGraph^3: a web tool for RDF visualization and creation by Dominik Tomaszuk, Przemysław Truchan https://openreview.net/forum?id=rka5DGt8Z&noteId=rka5DGt8Z

Identity and Blockchain by Joachim Lohkamp, Eugeniu Rusu, Fabian Kirstein https://openreview.net/forum?id=HJ94gXtUZ&noteId=HJ94gXtUZ

LinkChains: Exploring the space of decentralised trustworthy Linked Data by Allan Third and John Domingue https://openreview.net/forum?id=HJhwZNKIb&noteId=HJhwZNKIb

Decentralizing the Persistence and Querying of RDF Datasets Through Browser-Based Technologies by Blake Regalia https://openreview.net/forum?id=B1PRiIK8-&noteId=B1PRiIK8-

Attaching Semantic Metadata to Cryptocurrency Transactions by Luis-Daniel Ibáñez, Huw Fryer, Elena Simperl https://openreview.net/forum?id=S18mSwKUZ&noteId=S18mSwKUZ

Storage Balancing in P2P Based Distributed RDF Data Stores by Maximiliano Osorio, Carlos Buil-Aranda https://openreview.net/forum?id=rJn8cDtIb&noteId=rJn8cDtIb

Full list: https://openreview.net/group?id=swsa.semanticweb.org/ISWC/2017/DeSemWeb About the workshop: http://iswc2017.desemweb.org/

You and I know that “peer review” as practiced by pay-per-view journals is nearly useless.

Here, instead of an insider group of mutually supportive colleagues, there is the potential for non-insiders to participate.

Key word is “potential.” It won’t be more than “potential” to participate unless you step up to offer a review.

Well?

Further questions?

June 1, 2017

IPLD (Interplanetary Linked Data)

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 7:33 pm

IPLD (Interplanetary Linked Data)

IPLD is the data model of the content-addressable web. It allows us to treat all hash-linked data structures as subsets of a unified information space, unifying all data models that link data with hashes as instances of IPLD.

WHY IPLD?

A data model for interoperable protocols.

Content addressing through hashes has become a widely-used means of connecting data in distributed systems, from the blockchains that run your favorite cryptocurrencies, to the commits that back your code, to the web’s content at large. Yet, whilst all of these tools rely on some common primitives, their specific underlying data structures are not interoperable.

Enter IPLD: IPLD is a single namespace for all hash-inspired protocols. Through IPLD, links can be traversed across protocols, allowing you explore data regardless of the underlying protocol.

The webpage is annoyingly vague so you will need to visit the IPLD spec Github page and consider this whitepaper: IPFS – Content Addressed, Versioned, P2P File System (DRAFT 3) by Juan Benet.

As you read, can annotation of “links” avoid confusing of addresses with identifiers?

We’ve seen that before and the inability to acknowledge/correct the mistake was deadly.

December 26, 2015

HOBBIT – Holistic Benchmarking of Big Linked Data

Filed under: BigData,Linked Data,Semantic Web — Patrick Durusau @ 5:07 pm

HOBBIT – Holistic Benchmarking of Big Linked Data

From the “about” page:

HOBBIT is driven by the needs of the European industry. Thus, the project objectives were derived from the needs of the European industry (represented by our industrial partners) in combination with the results of prior and ongoing efforts including BIG, BigDataEurope, LDBC Council and many more. The main objectives of HOBBIT are:

  1. Building a family of industry-relevant benchmarks,
  2. Implementing a generic evaluation platform for the Big Linked Data value chain,
  3. Providing periodic benchmarking results including diagnostics to further the improvement of BLD processing tools,
  4. (Co-)Organizing challenges and events to gather benchmarking results as well as industry-relevant KPIs and datasets,
  5. Supporting companies and academics during the creation of new challenges or the evaluation of tools.

As we found in Avoiding Big Data: More Business Intelligence Than You Would Think, 3/4 of businesses cannot extract value from data they already possess, making any investment in “big data” a sure loser for them.

Which makes me wonder about what “big data” the HOBBIT project intends to use for benchmarking “Big Linked Data?”

Then I saw on the homepage:

The HOBBIT partners such as TomTom, USU, AGT and others will provide more than 25 trillions of sensor data to be bechmarked within the HOBBIT project.

“…25 trillions of sensor data….?” sounds odd until you realize that TomTom is:

TomTom founded in 1991 is a world leader of products for in-car location and navigation products.

OK, so the “Big Linked Data” in question isn’t random “linked data,” but a specialized kind of “linked data.”

That’s less risky than building a human brain with no clear idea of where to start, but it addresses a narrow window on linked data.

The HOBBIT Kickoff meeting Luxembourg 18-19 January 2016 announcement still lacks a detailed agenda.

December 18, 2015

‘Linked data can’t be your goal. Accomplish something’

Filed under: Linked Data,Marketing,Semantic Web — Patrick Durusau @ 11:08 am

Tim Strehle points to his post: Jonathan Rochkind: Linked Data Caution, which is a collection of quotes from Linked Data Caution (Jonathan Rochkind).

In the process, Tim creates his own quote, inspired by Rochkind:

‘Linked data can’t be your goal. Accomplish something’

Which is easy to generalize to:

‘***** can’t be your goal. Accomplish something’

Whether your hobby horse is linked data, graphs, noSQL, big data, or even topic maps, technological artifacts are just and only that, artifacts.

Unless and until such artifacts accomplish something, they are curios, relics venerated by pockets of the faithful.

Perhaps marketers in 2016 should be told:

Skip the potential benefits of your technology. Show me what it has accomplished (past tense) for users similar to me.

With that premise, you could weed through four or five vendors in a morning. 😉

September 30, 2015

“No Sir, Mayor Daley no longer dines here. He’s dead sir.” The Limits of Linked Data

Filed under: Linked Data,OCLC — Patrick Durusau @ 8:57 pm

That line from the Blues Brothers came to mind when I read OCLC to launch linked data pilot with seven leading libraries, which reads in part:

DUBLIN, Ohio, 11 September 2015OCLC is working with seven leading libraries in a pilot program designed to learn more about how linked data will influence library workflows in the future.

The Person Entity Lookup pilot will help library professionals reduce redundant data by linking related sets of person identifiers and authorities. Pilot participants will be able to surface WorldCat Person entities, including 109 million brief descriptions of authors, directors, musicians and others that have been mined from WorldCat, the world’s largest resource of library metadata.

By submitting one of a number of identifiers, such as VIAF, ISNI and LCNAF, the pilot service will respond with a WorldCat Person identifier and mappings to additional identifiers for the same person.

The pilot will begin in September and is expected to last several months. The seven participating libraries include Cornell University, Harvard University, the Library of Congress, the National Library of Medicine, the National Library of Poland, Stanford University and the University of California, Davis.

If you happen to use one of the known identifiers and like Mayor Daley, your subject is one of the 109 million authors, directors, musicians, etc., and you are at one of these seven participants, your in luck!

If your subject is one of the 253 million vehicles on U.S. roads, or one of the 123.4 million people employed full time in the U.S., or one or more of the 73.9 billion credit card transactions in 2012, or one of the 3 billion cellphone calls made every day in the U.S., then linked data and the OCLC pilot project will leave you high and dry. (Feel free to add in subjects of interest to you that aren’t captured by linked data.)

It’s not a bad pilot project but it does serve to highlight the primary weakness of linked data: It doesn’t include any subjects of interest to you.

You want to talk about your employees, your products, your investments, your trades, etc.

That’s understandable. That will drive your ROI from semantic technologies.

OCLC linked data can help you with dead people and some famous ones, but that doesn’t begin to satisfy your needs.

What you need is a semantic technology that puts the fewest constraints on you and at the same time enables to talk about your subjects, using your terms.

Interested?

June 27, 2015

Linked Data Repair and Certification

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 3:43 pm

1st International Workshop on Linked Data Repair and Certification (ReCert 2015) is a half-day workshop at the 8th International Conference on Knowledge Capture (K-CAP 2015).

I know, not nearly as interesting as talking about Raquel Welch, but someone has to. 😉

From the post:

In recent years, we have witnessed a big growth of the Web of Data due to the enthusiasm shown by research scholars, public sector institutions and some private companies. Nevertheless, no rigorous processes for creating or mapping data have been systematically followed in most cases, leading to uneven quality among the different datasets available. Though low quality datasets might be adequate in some cases, these gaps in quality in different datasets sometimes hinder the effective exploitation, especially in industrial and production settings.

In this context, there are ongoing efforts in the Linked Data community to define the different quality dimensions and metrics to develop quality assessment frameworks. These initiatives have mostly focused on spotting errors as part of independent research efforts, sometimes lacking a global vision. Further, up to date, no significant attention has been paid to the automatic or semi-automatic repair of Linked Data, i.e., the use of unattended algorithms or supervised procedures for the correction of errors in linked data. Repaired data is susceptible of receiving a certification stamp, which together with reputation metrics of the sources can lead to having trusted linked data sources.

The goal of the Workshop on Linked Data Repair and Certification is to raise the awareness of dataset repair and certification techniques for Linked Data and to promote approaches to assess, monitor, maintain, improve, and certify Linked Data quality.

There is a call for papers with the following deadlines:

Paper submission: Monday, July 20, 2015

Acceptance Notification: Monday August 3, 2015

Camera-ready version: Monday August 10, 2015

Workshop: Monday October 7, 2015

Now that linked data exists, someone has to undertake the task of maintaining it. You could make links in linked data into topics in a topic map and add properties that would make them easier to match and maintain. Just a thought.

As far as “trusted link data sources,” I think the correct phrasing is: “less untrusted data sources than others.”

You know the phrase: “In God we trust, all others pay cash.”

Same is true for data. It may be a “trusted” source, but verify the data first, then trust.

March 16, 2015

Grafter RDF Utensil

Filed under: Clojure,DSL,Graphs,Linked Data,RDF — Patrick Durusau @ 4:11 pm

Grafter RDF Utensil

From the homepage:

grafter

Easy Data Curation, Creation and Conversion

Grafter’s DSL makes it easy to transform tabular data from one tabular format to another. We also provide ways to translate tabular data into Linked Graph Data.

Data Formats

Grafter currently supports processing CSV and Excel files, with additional formats including Geo formats & shape files coming soon.

Separation of Concerns

Grafter’s design has a clear separation of concerns, disentangling tabular data processing from graph conversion and loading.

Incanter Interoperability

Grafter uses Incanter’s datasets, making it easy to incororate advanced statistical processing features with your data transformation pipelines.

Stream Processing

Grafter transformations build on Clojure’s laziness, meaning you can process large amounts of data without worrying about memory.

Linked Data Templates

Describe the linked data you want to create using simple templates that look and feel like Turtle.

Even if Grafter wasn’t a DSL, written in Clojure, producing graph output, I would have been compelled to mention it because of the cool logo!

Enjoy!

I first saw this in a tweet by ClojureWerkz.

January 7, 2015

Codelists by Statistics Belgium

Filed under: Linked Data — Patrick Durusau @ 2:41 pm

Codelists by Statistics Belgium

From the webpage:

The following codelists have been published according to the principles of linked data:

  • NIS codes, the list of alfanumeric codes for indicating administrative geographical areas as applied in statistical applications in Belgium
  • NACE 2003, the statistical Classification of Economic Activities in the European Community, Belgian adaptation, version 2003
  • NACE 2008, the statistical Classification of Economic Activities in the European Community, Belgian adaptation, version 2008

We hope to publish a mapping between NACE 2003 and NACE 2008 soon.

The data sets themselves may not be interesting but I did find the “Explore the dataset” options of interest. You can:

  • Lookup a resource by its identifier
  • Retrieve identifiers for a label (Reconciliation Service API)
  • Search by keyword

I tried Keerbergen, one of the examples under Retrieve identifiers for a label and got this result:

{“result”: [{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true},{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true},{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true},{“id”:”http://id.fedstats.be/nis/24048#id”,”score”:”15″,”name”:”Keerbergen”,
“type”: [“http://www.w3.org/2004/02/skos/core#Concept”],”match”: true}]}

Amazing.

Wikipedia reports on Keerbergen as follows:

Keerbergen (Dutch pronunciation: [ˈkeːrbɛrɣə(n)]) is a municipality located in the Belgian province of Flemish Brabant. The municipality comprises only the town of Keerbergen proper. On January 1, 2006 Keerbergen had a total population of 12,444. The total area is 18.39 km² which gives a population density of 677 inhabitants per km².

I would think town or municipality is more of a concept and Keerbergen is a specific place. Yes?

Just to be sure, I also searched for Keerbergen as a keyword, same results, Keerbergen is a concept.

You should check these datasets before drawing any conclusions based upon them.

I first saw this in a tweet by Paul Hermans (who lives in Keerbergen, the town).

December 20, 2014

Linked Open Data Visualization Revisited: A Survey

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 11:48 am

Linked Open Data Visualization Revisited: A Survey by Oscar Peña, Unai Aguilera and Diego López-de-Ipiña.

Abstract:

Mass adoption of the Semantic Web’s vision will not become a reality unless the benefits provided by data published under the Linked Open Data principles are understood by the majority of users. As technical and implementation details are far from being interesting for lay users, the ability of machines and algorithms to understand what the data is about should provide smarter summarisations of the available data. Visualization of Linked Open Data proposes itself as a perfect strategy to ease the access to information by all users, in order to save time learning what the dataset is about and without requiring knowledge on semantics.

This article collects previous studies from the Information Visualization and the Exploratory Data Analysis fields in order to apply the lessons learned to Linked Open Data visualization. Datatype analysis and visualization tasks proposed by Ben Shneiderman are also added in the research to cover different visualization features.

Finally, an evaluation of the current approaches is performed based on the dimensions previously exposed. The article ends with some conclusions extracted from the research.

I would like to see a version of this article after it has had several good editing passes. From the abstract alone, “…benefits provided by data…” and “…without requiring knowledge on semantics…” strike me as extremely problematic.

Data, accessible or not, does not provide benefits. The results of processing data may, which may explain the lack of enthusiasm when large data dumps are made web accessible. In and of itself, it is just another large dump of data. The results of processing that data may be very useful, but that is another step in the process.

I don’t think “…without requiring knowledge of semantics…” is in line with the rest of the article. I suspect the authors meant the semantics of data sets could be conveyed to users without their researching them prior to using the data set. I think that is problematic but it has the advantage of being plausible.

The various theories of visualization and datatypes (pages 3-8) don’t seem to advance the discussion and I would either drop that content or tie it into the actual visualization suites discussed. It’s educational but its relationship to the rest of the article is tenuous.

The coverage of visualization suites is encouraging and useful, but with an overall tighter focus, more time could be spent on each one and their entries being correspondingly longer.

Hopefully we will see a later, edited version of this paper as a good summary/guide to visualization tools for linked data would be a useful addition to the literature.

I first saw this in a tweet by Marin Dimitrov.

November 11, 2014

Discovering Patterns for Cyber Defense Using Linked Data Analysis [12th Nov., 10am PDT]

Filed under: Cybersecurity,Hadoop,Hortonworks,Linked Data — Patrick Durusau @ 5:22 pm

Discovering Patterns for Cyber Defense Using Linked Data Analysis

Wednesday, Nov. 12th | 10am PDT

I am always suspicious of one-day announcements of webinars. This post appeared on November 11th for a webinar on November 12th.

Only one way to find out so I registered. Join me to find out: substantive presentation or click-bait.

If enough people attend and then comment here, one way or the other, who knows? It might make a difference.

From the post:

Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense.

Apache Hadoop has emerged as the de facto big data platform, which makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks.  As Enterprises roll out and grow their Hadoop implementations, they require effective ways for pinpointing and reasoning about correlated events within their data, and assessing their network security posture.

Join Hortonworks and Sqrrl to learn:

  • How Linked Data Analysis enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data
  • Effective ways to correlated events within your data, and assessing your network security posture
  • New techniques for discovering hidden patterns and detecting anomalies within your data
  • How Hadoop fits into your current data structure forming a secure, Modern Data Architecture

Register now to learn how combining the power of Hadoop and the Hortonworks Data Platform with massive, secure, entity-centric data models in Sqrrl Enterprise allows you to create a data-driven defense.

Bring your red pen. November 12, 2014 at 10am PDT. (That should be 1pm East Coast time.) See you then!

Linked Data Integration with Conflicts

Filed under: Data Integration,Integration,Linked Data — Patrick Durusau @ 4:40 pm

Linked Data Integration with Conflicts by Jan Michelfeit, Tomáš Knap, Martin Nečaský.

Abstract:

Linked Data have emerged as a successful publication format and one of its main strengths is its fitness for integration of data from multiple sources. This gives them a great potential both for semantic applications and the enterprise environment where data integration is crucial. Linked Data integration poses new challenges, however, and new algorithms and tools covering all steps of the integration process need to be developed. This paper explores Linked Data integration and its specifics. We focus on data fusion and conflict resolution: two novel algorithms for Linked Data fusion with provenance tracking and quality assessment of fused data are proposed. The algorithms are implemented as part of the ODCleanStore framework and evaluated on real Linked Open Data.

Conflicts in Linked Data? The authors explain:

From the paper:

The contribution of this paper covers the data fusion phase with conflict resolution and a conflict-aware quality assessment of fused data. We present new algorithms that are implemented in ODCleanStore and are also available as a standalone tool ODCS-FusionTool.2

Data fusion is the step where actual data merging happens – multiple records representing the same real-world object are combined into a single, consistent, and clean representation [3]. In order to fulfill this definition, we need to establish a representation of a record, purge uncertain or low-quality values, and resolve identity and other conflicts. Therefore we regard conflict resolution as a subtask of data fusion.

Conflicts in data emerge during the data fusion phase and can be classified as schema, identity, and data conflicts. Schema conflicts are caused by di fferent source data schemata – di fferent attribute names, data representations (e.g., one or two attributes for name and surname), or semantics (e.g., units). Identity conflicts are a result of di fferent identifiers used for the same real-world objects. Finally, data conflicts occur when di fferent conflicting values exist for an attribute of one object.

Conflict can be resolved on entity or attribute level by a resolution function. Resolution functions can be classified as deciding functions, which can only choose values from the input such as the maximum value, or mediating functions, which may produce new values such as average or sum [3].

Oh, so the semantic diversity of data simply flowed into Linked Data representation.

Hmmm, watch for a basis for in the data for resolving schema, identity and data conflicts.

The related work section is particularly rich with references to non-Linked Data conflict resolution projects. Definitely worth a close read and chasing the references.

To examine the data fusion and conflict resolution algorithm the authors start by restating the problem:

  1. Diff erent identifying URIs are used to represent the same real-world entities.
  2. Diff erent schemata are used to describe data.
  3. Data conflicts emerge when RDF triples sharing the same subject and predicate have inconsistent values in place of the object.

I am skipping all the notation manipulation for the quads, etc., mostly because of the inputs into the algorithm:

ld-input-resolution

As a result of human intervention, the different identifying URIs have been mapped together. Not to mention the weighting of the metadata and the desired resolution for data conflicts (location data).

With that intervention, the complex RDF notation and manipulation becomes irrelevant.

Moreover, as I am sure you are aware, there is more than one “Berlin” listed in DBpedia. Several dozen as I recall.

I mention that because the process as described does not say where the authors of the rules/mappings obtained the information necessary to distinguish one Berlin from another?

That is critical for another author to evaluate the correctness of their mappings.

At the end of the day, after the “resolution” proposed by the authors, we are in no better position to map their result to another than we were at the outset. We have bald statements with no additional data on which to evaluate those statements.

Give Appendix A. List of Conflict Resolution Functions, a close read. The authors have extracted conflict resolution functions from the literature. Should be a time saver as well as suggestive of other needed resolution functions.

PS: If you look for ODCS-FusionTool you will find LD-Fusion Tool (GitHub), which was renamed to ODCS-FusionTool a year ago. See also the official LD-FusionTool webpage.

October 31, 2014

DBpedia now available as triple pattern fragments

Filed under: DBpedia,Linked Data — Patrick Durusau @ 2:01 pm

DBpedia now available as triple pattern fragments by Ruben Verborgh.

From the post:

DBpedia is perhaps the most widely known Linked Data source on the Web. You can use DBpedia in a variety of ways: by querying the SPARQL endpoint, by browsing Linked Data documents, or by downloading one of the data dumps. Access to all of these data sources is offered free of charge.

Last week, a fourth way of accessing DBpedia became publicly available: DBpedia’s triple pattern fragments at http://fragments.dbpedia.org/. This interface offers a different balance of trade-offs: it maximizes the availability of DBpedia by offering a simple server and thus moving SPARQL query execution to the client side. Queries will execute slower than on the public SPARQL endpoint, but their execution should be possible close to 100% of the time.

Here are some fun things to try:
– browse the new interface: http://fragments.dbpedia.org/2014/en?object=dbpedia%3ALinked_Data
– make your browser execute a SPARQL query: http://fragments.dbpedia.org/
– add live queries to your application: https://github.com/LinkedDataFragments/Client.js#using-the-library

Learn all about triple pattern fragments at the Linked Data Fragments website http://linkeddatafragments.org/, the ISWC2014 paper http://linkeddatafragments.org/publications/iswc2014.pdf,
and ISWC2014 slides: http://www.slideshare.net/RubenVerborgh/querying-datasets-on-the-web-with-high-availability.

A new effort to achieve robust processing of triples.

Enjoy!

Enhancing open data with identifiers

Filed under: Linked Data,Open Data — Patrick Durusau @ 12:07 pm

Enhancing open data with identifiers

From the webpage:

The Open Data Institute and Thomson Reuters have published a new white paper, explaining how to use identifiers to create extra value in open data.

Identifiers are at the heart of how data becomes linked. It’s a subject that is fundamentally important to the open data community, and to the evolution of the web itself. However, identifiers are also in relatively early stages of adoption, and not many are aware of what they are.
w
Put simply, identifiers are labels used to refer to an object being discussed or exchanged, such as products, companies or people. The foundation of the web is formed by connections that hold pieces of information together. Identifiers are the anchors that facilitate those links.

This white paper, ‘Creating value with identifiers in an open data world’ is a joint effort between Thomson Reuters and the Open Data Institute. It is written as a guide to identifier schemes:

  • why identity can be difficult to manage;
  • why it is important for open data;
  • what challenges there are today and recommendations for the community to address these in the future.

Illustrative examples of identifier schemes are used to explain these points.

The recommendations are based on specific issues found to occur across different datasets, and should be relevant for anyone using, publishing or handling open data, closed data and/or their own proprietary data sets.

Are you a data consumer?
Learn how identifiers can help you create value from discovering and connecting to other sources of data that add relevant context.

Are you a data publisher?
Learn how understanding and engaging with identifier schemes can reduce your costs, and help you manage complexity.

Are you an identifier publisher?
Learn how open licensing can grow the open data commons and bring you extra value by increasing the use of your identifier scheme.

The design and use of successful identifier schemes requires a mix of social, data and technical engineering. We hope that this white paper will act as a starting point for discussion about how identifiers can and will create value by empowering linked data.

Read the blog post on Linked data and the future of the web, from Chief Enterprise Architect for Thomson Reuters, Dave Weller.

When citing this white paper, please use the following text: Open Data Institute and Thomson Reuters, 2014, Creating Value with Identifiers in an Open Data World, retrieved from thomsonreuters.com/site/data-identifiers/

Creating Value with Identifiers in an Open Data World [full paper]

Creating Value with Identifiers in an Open Data World [management summary]

From the paper:

The coordination of identity is thus not just an inherent component of dataset design, but should be acknowledged as a distinct discipline in its own right.

A great presentation on identity and management of identifiers, echoing many of the themes discussed in topic maps.

A must read!

Next week I will begin a series of posts on the individual issues identified in this white paper.

I first saw this in a tweet by Bob DuCharme.

October 20, 2014

LSD Dimensions

Filed under: Linked Data,RDF,RDF Data Cube Vocabulary,Statistics — Patrick Durusau @ 7:50 pm

LSD Dimensions

From the about page: http://lsd-dimensions.org/dimensions

LSD Dimensions is an observatory of the current usage of dimensions and codes in Linked Statistical Data (LSD).

LSD Dimensions is an aggregator of all qb:DimensionProperty resources (and their associated triples), as defined in the RDF Data Cube vocabulary (W3C recommendation for publishing statistical data on the Web), that can be currently found in the Linked Data Cloud (read: the SPARQL endpoints in Datahub.io). Its purpose is to improve the reusability of statistical dimensions, codes and concept schemes in the Web of Data, providing an interface for users (future work: also for programs) to search for resources commonly used to describe open statistical datasets.

Usage

The main view shows the count of queried SPARQL endpoints and the number of retrieved dimensions, together with a table that displays these dimensions.

  • Sorting. Dimensions can be sorted by their dimension URI, label and number of references (i.e. number of times a dimension is used in the endpoints) by clicking on the column headers.
  • Pagination. The number of rows per page can be customized and browsed by clicking at the bottom selectors.
  • Search. String-based search can be performed by writing the search query in the top search field.

Any of these dimensions can be further explored by clicking at the eye icon on the left. The dimension detail view shows

  • Endpoints.. The endpoints that make use of that dimension.
  • Codes. Popular codes that are defined (future work: also assigned) as valid values for that dimension.

Motivation

RDF Data Cube (QB) has boosted the publication of Linked Statistical Data (LSD) as Linked Open Data (LOD) by providing a means “to publish multi-dimensional data, such as statistics, on the web in such a way that they can be linked to related data sets and concepts”. QB defines cubes as sets of observations affected by dimensions, measures and attributes. For example, the observation “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years” has three dimensions (time period, with value 2004-2006; region, with value Newport; and sex, with value male), a measure (population life expectancy) and two attributes (the units of measure, years; and the metadata status, measured, to make explicit that the observation was measured instead of, for instance, estimated or interpolated). In some cases, it is useful to also define codes, a closed set of values taken by a dimension (e.g. sensible codes for the dimension sex could be male and female).

There is a vast diversity of domains to publish LSD about, and quite some dimensions and codes can be very heterogeneous, domain specific and hardly comparable. To this end, QB allows users to mint their own URIs to create arbitrary dimensions and associated codes. Conversely, some other dimensions and codes are quite common in statistics, and could be easily reused. However, publishers of LSD have no means to monitor the dimensions and codes currently used in other datasets published in QB as LOD, and consequently they cannot (a) link to them; nor (b) reuse them.

This is the motivation behind LSD Dimensions: it monitors the usage of existing dimensions and codes in LSD. It allows users to browse, search and gain insight into these dimensions and codes. We depict the diversity of statistical variables in LOD, improving their reusability.

(Emphasis added.)

The highlighted text:

There is a vast diversity of domains to publish LSD about, and quite some dimensions and codes can be very heterogeneous, domain specific and hardly comparable.

is the key isn’t it? If you can’t rely on data titles, users must examine the data and determine which sets can or should be compared.

The question then is how do you capture the information such users developed in making those decisions and pass it on to following users? Or do you just allow following users make their own way afresh?

If you document the additional information for each data set, by using a topic map, each use of this resource becomes richer for the following users. Richer or stays the same. Your call.

I first saw this in a tweet by Bob DuCharme. Who remarked this organization has a great title!

If you have made it this far, you realize that with all the data set, RDF and statistical language this isn’t the post you were looking for. 😉

PS: Yes Bob, it is a great title!

October 16, 2014

COLD 2014 Consuming Linked Data

Filed under: Linked Data — Patrick Durusau @ 6:38 pm

COLD 2014 Consuming Linked Data

Table of Contents

You can get an early start on your weekend reading now! 😉

October 15, 2014

How To Build Linked Data APIs…

Filed under: Linked Data,RDF,Schema.org,Semantic Web,Uncategorized — Patrick Durusau @ 7:23 pm

This is the second high signal-to-noise presentation I have seen this week! I am sure that streak won’t last but I will enjoy it as long as it does.

Resources for after you see the presentation: Hydra: Hypermedia-Driven Web APIs, JSON for Linking Data, and, JSON-LD 1.0.

Near the end of the presentation, Marcus quotes Phil Archer, W3C Data Activity Lead:

Archer on Semantic Web

Which is an odd statement considering that JSON-LD 1.0 Section 7 Data Model, reads in part:

JSON-LD is a serialization format for Linked Data based on JSON. It is therefore important to distinguish between the syntax, which is defined by JSON in [RFC4627], and the data model which is an extension of the RDF data model [RDF11-CONCEPTS]. The precise details of how JSON-LD relates to the RDF data model are given in section 9. Relationship to RDF.

And section 9. Relationship to RDF reads in part:

JSON-LD is a concrete RDF syntax as described in [RDF11-CONCEPTS]. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize Generalized RDF Datasets. The JSON-LD extensions to the RDF data model are:…

Is JSON-LD “…a concrete RDF syntax…” where you can ignore RDF?

Not that I was ever a fan of RDF but standards should be fish or fowl and not attempt to be something in between.

August 28, 2014

Linked Data Platform Best Practices…

Filed under: Linked Data,Standards — Patrick Durusau @ 1:34 pm

Linked Data Platform Best Practices and Guidelines Note Published

From the post:

The Linked Data Platform (LDP) Working Group has published a Group Note of Linked Data Platform Best Practices and Guidelines. This document provides best practices and guidelines for implementing Linked Data Platform servers and clients. Learn more about the Data Activity.

The document takes pains to distinguish “best practice” from “guidance:”

For the purposes of this document, it is useful to make a minor, yet important distinction between the term ‘best practice’ and the term ‘guideline’. We define and differentiate the terms as follows:

best practice
An implementation practice (method or technique) that has consistently shown results superior to those achieved with other means and that is used as a benchmark. Best practices within this document apply specifically to the ways that LDP servers and clients are implemented as well as how certain resources are prepared and used with them. In this document, the best practices might be used as a kind of check-list against which an implementer can directly evaluate a system’s design and code. Lack of adherence to any given best practice, however, does not necessarily imply a lack of quality; they are recommendations that are said to be ‘best’ in most cases and in most contexts, but not all. A best practice is always subject to improvement as we learn and evolve the Web together.
guideline
A tip, a trick, a note, a suggestion, or answer to a frequently asked question. Guidelines within this document provide useful information that can advance an implementer’s knowledge and understanding, but that may not be directly applicable to an implementation or recognized by consensus as a ‘best practice’.

Personally I don’t see the distinction as useful but I bring it to your attention in case you are reading or authoring in this space.

August 20, 2014

Exposing Resources in Datomic…

Filed under: Datomic,Linked Data,RDF — Patrick Durusau @ 2:39 pm

Exposing Resources in Datomic Using Linked Data by Ratan Sebastian.

From the post:

Financial data feeds from various data providers tend to be closed off from most people due to high costs, licensing agreements, obscure documentation, and complicated business logic. The problem of understanding this data, and providing access to it for our application is something that we (and many others) have had to solve over and over again. Recently at Pellucid we were faced with three concrete problems

  1. Adding a new data set to make data visualizations with. This one was a high-dimensional data set and we were certain that the queries that would be needed to make the charts had to be very parameterizable.

  2. We were starting to come to terms with the difficulty of answering support questions about the data we use in our charts given that we were serving up the data using a Finagle service that spoke a binary protocol over TCP. Support staff should not have to learn Datomic’s highly expressive query language, Datalog or have to set up a Scala console to look at the raw data that was being served up.

  3. Different data sets that we use had semantically equivalent data that was being accessed in ways specific to that data set.

And as a long-term goal we wanted to be able to query across data sets instead of doing multiple queries and joining in memory.

These are very orthogonal goals to be sure. We embarked on a project which we thought might move us in those three directions simultaneously. We’d already ingested the data set from the raw file format into Datomic, which we love. Goal 2 was easily addressable by conveying data over a more accessible protocol. And what’s more accessible than REST. Goal 1 meant that we’d have expose quite a bit of Datalog expressivity to be able to write all the queries we needed. And Goal 3 hinted at the need for some way to talk about things in different data silos using a common vocabulary. Enter the Linked Data Platform. A W3C project, the need for which is brilliantly covered in this talk. What’s the connection? Wait for it…

The RDF Datomic Mapping

If you are happy with Datomic and RDF primitives, for semantic purposes, this may be all you need.

You have to appreciate Ratan’s closing sentiments:

We believe that a shared ontology of financial data could be very beneficial to many and open up the normally closeted world of handling financial data.

Even though we know as a practical matter that no “shared ontology of financial data” is likely to emerge.

In the absence of such a shared ontology, there are always topic maps.

July 30, 2014

JudaicaLink released

Filed under: Encyclopedia,History,Humanities,Linked Data — Patrick Durusau @ 3:07 pm

JudaicaLink released

From the post:

Data extractions from two encyclopediae from the domain of Jewish culture and history have been released as Linked Open Data within our JudaicaLink project.

JudaicaLink now provides access to 22,808 concepts in English (~ 10%) and Russian (~ 90%), mostly locations and persons.

See here for further information: http://www.judaicalink.org/blog/kai-eckert/encyclopedia-russian-jewry-released-updates-yivo-encyclopedia

Next steps in this project include “…the creation of links between the two encyclopedias and links to external sources like DBpedia or Geonames.”

In case you are interested, the two encyclopedias are:

The YIVO Encyclopedia of Jews in Eastern Europe, courtesy of the YIVO Institute of Jewish Research, NY.

Rujen.ru provides an Internet version of the Encyclopedia of Russian Jewry, which is published in Moscow since 1994, giving a comprehensive, objective picture of the life and activity of the Jews of Russia, the Soviet Union and the CIS.

For more details: Encyclopediae

If you are looking to contribute content or time to a humanities project, this should be on your short list.

July 25, 2014

Linked Data 2011 – 2014

Filed under: Linked Data,LOD — Patrick Durusau @ 10:45 am

One of the more well known visualizations of the Linked Data Cloud has been updated to 2014. For comparison purposes, I have included the 2011 version as well.

LOD Cloud 2011

LOD Cloud 2011

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

LOD Cloud 2014

LOD Cloud 2014

From Adoption of Linked Data Best Practices in Different Topical Domains by Max Schmachtenberg, Heiko Paulheim, and Christian Bizer.

How would you characterize the differences between the two?

Partially a question of how to use large graph displays? Even though the originals (in some versions) are interactive, how often is an overview of related linked data sets required?

I first saw this in a tweet by Juan Sequeda.

July 23, 2014

Choke-Point based Benchmark Design

Filed under: Benchmarks,Graphs,Linked Data — Patrick Durusau @ 7:05 pm

Choke-Point based Benchmark Design by Peter Boncz.

From the post:

The Linked Data Benchmark Council (LDBC) mission is to design and maintain benchmarks for graph data management systems, and establish and enforce standards in running these benchmarks, and publish and arbitrate around the official benchmark results. The council and its ldbcouncil.org website just launched, and in its first 1.5 year of existence, most effort at LDBC has gone into investigating the needs of the field through interaction with the LDBC Technical User Community (next TUC meeting will be on October 5 in Athens) and indeed in designing benchmarks.

So, what makes a good benchmark design? Many talented people have paved our way in addressing this question and for relational database systems specifically the benchmarks produced by TPC have been very helpful in maturing relational database technology, and making it successful. Good benchmarks are relevant and representative (address important challenges encountered in practice), understandable , economical (implementable on simple hardware), fair (such as not to favor a particular product or approach), scalable, accepted by the community and public (e.g. all of its software is available in open source). This list stems from Jim Gray’s Benchmark Handbook. In this blogpost, I will share some thoughts on each of these aspects of good benchmark design.

Just in case you want to start preparing for the Athens meeting:

The Social Network Benchmark 0.1 draft and supplemental materials.

The Semantic Publishing Benchmark 0.1 draft and supplemental materials.

Take the opportunity to download the benchmark materials edited by Jim Gray. Will be useful in evaluating the benchmarks of the LDBC.

July 15, 2014

RDFUnit

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 4:04 pm

RDFUnit – an RDF Unit-Testing suite

From the post:

RDFUnit is a test driven data-debugging framework that can run automatically generated (based on a schema) and manually generated test cases against an endpoint. All test cases are executed as SPARQL queries using a pattern-based transformation approach.

For more information on our methodology please refer to our report:

Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.

RDFUnit in a Nutshell

  • Test case: a data constraint that involves one or more triples. We use SPARQL as a test definition language.
  • Test suite: a set of test cases for testing a dataset
  • Status: Success, Fail, Timeout (complexity) or Error (e.g. network). A Fail can be an actual error, a warning or a notice
  • Data Quality Test Pattern (DQTP): Abstract test cases that can be intantiated into concrete test cases using pattern bindings
  • Pattern Bindings: valid replacements for a DQTP variable
  • Test Auto Generators (TAGs): Converts RDFS/OWL axioms into concrete test cases

If you are working with RDF data, this will certainly be helpful.

BTW, don’t miss the publications further down on the homepage for RDFUnit.

I first saw this in a tweet by Marin Dimitrov.

Linked Data Guidelines (Australia)

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 3:30 pm

First Version of Guidelines for Publishing Linked Data released by Allan Barger.

From the post:

The Australian Government Linked Data Working group (AGLDWG) is pleased to announce the release of a first version of a set of guidelines for the publishing of Linked Datasets on data.gov.au at:

https://github.com/AGLDWG/TR/wiki/URI-Guidelines-for-publishing-linked-datasets-on-data.gov.au-v0.1

The “URI Guidelines for publishing Linked Datasets on data.gov.au” document provides a set of general guidelines aimed at helping Australian Government agencies to define and manage URIs for Linked Datasets and the resources described within that are published on data.gov.au. The Australian Government Linked Data Working group has developed the report over the last two years while the first datasets under the environment.data.gov.au sub-domain have been published following the patterns defined in this document.

Thought you might find this useful in mapping linked data sets from the Australian government to:

  • non-Australian government linked data sets
  • non-government linked data sets
  • non-linked data data sets (all sources)
  • pre-linked data data sets (all sources)
  • post-linked data data sets (all sources)

Enjoy!

July 11, 2014

LDBC benchmarks reach Public Draft

Filed under: Benchmarks,Linked Data — Patrick Durusau @ 4:12 pm

LDBC benchmarks reach Public Draft

From the post:

The Linked Data Benchmark Council (LDBC) is reaching a milestone today, June 23 2014, in announcing that two of the benchmarks that it has been developing since 1.5 years have now reached the status of Public Draft. This concerns the Semantic Publishing Benchmark (SPB) and the interactive workload of the Social Network Benchmark (SNB). In case of LDBC, the release is staged: now the benchmark software just runs read-only queries. This will be expanded in a few weeks with a mix of read- and insert-queries. Also, query validation will be added later. Watch this blog for the announcements to come, as this will be a matter of weeks to add.

The Public Draft stage means that the initial software (data generator, query driver) work and an initial technical specification and documentation has been written. In other words, there is a testable version of the benchmark available for anyone who is interested. Public Draft status does not mean that the benchmark has been adopted yet, it rather means that LDBC has come closer to adopting them, but is now soliciting feedback from the users. The benchmarks will remain in this stage at least until October 6. On that date, LDBC is organizing its fifth Technical User Community meeting. One of the themes for that meeting is collecting user feedback on the Public Drafts; which input will be used to either further evolve the benchmarks, or adopt them.

You can also see that we created a this new website and a new logo. This website is different from http://ldbc.eu that describes the EU project which kick-starts LDBC. The ldbcouncil.org is a website maintained by the Linked Data Benchmark Council legal entity, which will live on after the EU project stops (in less than a year). The Linked Data Benchmark Council is an independent, impartial, member-sustained organization dedicated to the creation of RDF and graph data management benchmarks and benchmark practices.

What do you expect with an announcement of a public review draft?

A link to the public review draft?

If so, you are out of luck with the new Linked Data Benchmark Council website. Nice looking website, poor on content.

Let me help out:

The Social Network Benchmark 0.1 draft and supplemental materials.

The Semantic Publishing Benchmark 0.1 draft and supplemental materials.

Pointing readers to drafts makes it easier for them to submit comments. These drafts will remain open for comments “at least until October 6” according to the post.

At which time they will be further evolved or adopted? Suggest you review and get your comments in early.

June 25, 2014

Covering the European Elections with Linked Data

Filed under: EU,Linked Data,Politics — Patrick Durusau @ 7:15 pm

Covering the European Elections with Linked Data by Basile Simon.

From the post:

What we wanted to do was:

  • to use Linked Data in a news context (something that the Vote2014 team was trying to do with Paul’s new model, article above),
  • to provide some background on this important event for the UK and Europe,
  • and to offer alternative coverage of the election (sort of).

In the end, we built an experimental dashboard for the elections, and eventually discovered some potentially editorially challenging stuff in our data—detailed below—which led us to decide not to release the experiment to the public. Despite being unable to release the project, this one or two weeks rush taught us lots, and we are today coming up with improvements to our data model, following the questions raised by our findings. Before we get to the findings, though, I’ll walk through the process of making the dashboard.

If you are thinking about covering the U.S. mid-term elections this Fall, you need to read Basile’s post.

Not only will you be inspired in many ways but you will gain insight into what it will take to have a quality interface ready by election time. It is a non-trivial task but apparently a very exciting one.

Perhaps you can provide an alternative to the mind numbing stalling until enough results are in for the elections to be called.

June 16, 2014

JSON-LD for software discovery…

Filed under: JSON,Linked Data,RDF,Semantic Web — Patrick Durusau @ 3:43 pm

JSON-LD for software discovery, reuse and credit by Afron Smith.

From the post:

JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:

{ "name" : "Arfon" }

when there’s an entity called name you know that it means the name of a person and not a place.

If you haven’t heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.

One of the reasons JSON-LD is particularly exciting is that it’s a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API. JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they’re describing.

The YouTube video “What is JSON-LD?” by Manu Sporny makes an interesting point about the “ambiguity problem,” that is do you mean by “name” what I mean by “name” as a property?

At about time mark 5:36, Manu addresses the “ambiguity problem.”

The resolution of the ambiguity is to use a hyperlink as an identifier, the implication being that if we use the same identifier, we are talking about the same thing. (That isn’t true in real life, cf. the many meanings of owl:sameAS, but for simplicity sake, let’s leave that to one side.)

OK, what is the difference in both of us using the string “name” and both of us using the string “http://ex.com/name”? Both of them are opaque strings that either match or don’t. This just kicks the semantic can a little bit further down the road.

Let me use a better example from json-ld.org:

{
"@context": "http://json-ld.org/contexts/person.jsonld",
"@id": "http://dbpedia.org/resource/John_Lennon",
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}

If you follow http://json-ld.org/contexts/person.jsonld you will obtain a 2.4k JSON-LD file that contains (in part):

“Person”: “http://xmlns.com/foaf/0.1/Person

Following that link results in a webpage that reads in part:

The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.

and it is said to be:

Disjoint With: Project Organization

Ambiguity jumps back to the fore with: Something is a Person if it is a person.

What is that solipsism? Tautology?

There is no opportunity to say what properties are necessary to qualify as a “person” in the sense defined FOAF.

You may think that is nit-picking but without the ability to designate properties required to be a “person,” it isn’t possible to talk about U.S.C Title 42: 1983 civil rights actions where municipalities are held to be “persons” within the meaning of this law. That’s just one example. There are numerous variations on “person” for legal purposes.

You could argue that JSON-LD is for superficial or bubble-gum semantics but it is too useful a syntax for that fate.

Rather I would like to see JSON-LD to make ambiguity “manageable” by its users. True, you could define a “you know what I mean” document like FOAF, if that suits your purposes. On the other hand, you should be able to define required key/value pairs for any subject and for any key or value to extend an existing definition.

How far you need to go is on a case by case basis. For apps that display “AI” by tracking you and pushing more ads your way, FOAF may well be sufficient. For those of us with non-advertising driven interests, other diversions may await.

May 3, 2014

OCLC releases WorldCat Works as linked data

Filed under: Linked Data,OCLC,WorldCat — Patrick Durusau @ 7:08 pm

OCLC releases WorldCat Works as linked data

From the press release:

OCLC has made 197 million bibliographic work descriptions—WorldCat Works—available as linked data, a format native to the Web that will improve discovery of library collections through a variety of popular sites and Web services.

Release of this data marks another step toward providing interconnected linked data views of WorldCat. By making this linked data available, library collections can be exposed to the wider Web community, integrating these collections and making them more easily discoverable through websites and services that library users visit daily, such as Google, Wikipedia and social networks.

“Bibliographic data stored in traditional record formats has reached its limits of efficiency and utility,” said Richard Wallis, OCLC Technology Evangelist. “New technologies, influenced by the Web, now enable us to move toward managing WorldCat data as entities—such as ‘Works,’ ‘People,’ ‘Places’ and more—as part of the global Web of data.”

OCLC has created authoritative work descriptions for bibliographic resources found in WorldCat, bringing together multiple manifestations of a work into one logical authoritative entity. The release of “WorldCat Works” is the first step in providing linked data views of rich WorldCat entities. Other WorldCat descriptive entities will be created and released over time.

If you are looking for a smallish set of entity identifiers, this is a good start on bibliographic materials.

I say smallish because as of 2009, there were 672 million assigned phone numbers in the United States (Numbering Resource Utilization in the United States).

Each of those phone numbers has the potential to identify some subject. The assigned number if nothing else. Although other uses suggest themselves.

April 20, 2014

Annotating, Extracting, and Linking Legal Information

Filed under: Annotation,Extraction,Law,Law - Sources,Legal Informatics,Linked Data — Patrick Durusau @ 3:59 pm

Annotating, Extracting, and Linking Legal Information by Adam Wyner. (slides)

Great slides, provided you have enough background in the area to fill in the gaps.

I first saw this at: Wyner: Annotating, Extracting, and Linking Legal Information, which has collected up the links/resources mentioned in the slides.

Despite decades of electronic efforts and several centuries of manual effort before that, legal information retrieval remains an open challenge.

March 12, 2014

Towards Web-scale Web querying [WWW vs. Internet]

Filed under: Linked Data,SPARQL — Patrick Durusau @ 4:17 pm

Towards Web-scale Web querying: The quest for intelligent clients starts with simple servers that scale. by Ruben Verborgh.

From the post:

Most public SPARQL endpoints are down for more than a day per month. This makes it impossible to query public datasets reliably, let alone build applications on top of them. It’s not a performance issue, but an inherent architectural problem: any server offering resources with an unbounded computation time poses a severe scalability threat. The current Semantic Web solution to querying simply doesn’t scale. The past few months, we’ve been working on a different model of query solving on the Web. Instead of trying to solve everything at the server side—which we can never do reliably—we should build our servers in such a way that enables clients to solve queries efficiently.

The Web of Data is filled with an immense amount of information, but what good is that if we cannot efficiently access those bits of information we need?

SPARQL endpoints aim to fulfill the promise of querying on the Web, but their notoriously low availability rates make that impossible. In particular, if you want high availability for your SPARQL endpoint, you have to compromise one of these:

  • offering public access,
  • allowing unrestricted queries,
  • serving many users.

Any SPARQL endpoint that tries to fulfill all of those inevitably has low availability. Low availability means unreliable query access to datasets. Unreliable access means we cannot build applications on top of public datasets.

Sure, you could just download a data dump and have your own endpoint, but then you move from Web querying to local querying, and that problem has been solved ages ago. Besides, it doesn’t give you access to up to date information, and who has enough storage to download a dump of the entire Web?

The whole “endpoint” concept will never work on a Web scale, because servers are subject to arbitrarily complex requests by arbitrarily many clients. (emphasis in original)

The prelude to an interesting proposal on Linked Data Fragments.

See the Linked Data Fragments website or Web-Scale Querying through Linked Data Fragments by Ruben Verborgh, et. al. (LDOW2014 workshop).

The paper gives a primary motivation as:

There is one issue: it appears to be very hard to make a sparql endpoint available reliably. A recent survey examining 427 public endpoints concluded that only one third of them have an availability rate above 99%; not even half of all endpoints reach 95% [6]. To put this into perspective: 95% availability means the server is unavailable for one and a half days every month. These figures are quite disturbing given the fact that availability is usually measured in “number of nines” [5, 25], counting the number of leading nines in the availability percentage. In comparison, the fairly common three nines (99.9%) amounts to 8.8 hours of downtime per year. The disappointingly low availability of public sparql endpoints is the Semantic Web community’s very own “Inconvenient Truth”.

Curious that on the twenty-fifth anniversary of the WWW that I would realize the WWW re-created a networking problem solved by the Internet.

Unlike the WWW, to say nothing of Linked Data and its cousins in the SW activity, the Internet doesn’t have a single point of failure.

Or put more positively, the Internet is fault-tolerant by design. In contrast, the SW is fragile, by design.

While I applaud the Linked Data fragment exploration of the solution space, focusing on the design flaw of a single point of failure might be more profitable.

I first saw this in a tweet by Thomas Steiner.

March 7, 2014

Trapping Users with Linked Data (WorldCat)

Filed under: Linked Data,WorldCat — Patrick Durusau @ 5:33 pm

WorldCat Works Linked Data – Some Answers To Early Questions by Richard Wallis.

The most interesting question Richard answers:

Q Is there a bulk download available?
No there is no bulk download available. This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data. WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source. The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.

I will give you one guess on who is deciding on the entities, identifiers and relationships to be maintained.

Hint: It’s not you.

Which in my view is one of the principal weaknesses of Linked Data.

In order to participate, you have to forfeit your right to organize your world differently than it has been organized by Richard Wallis, WorldCat and others.

I am sure they all have good intentions and WorldCat will come close enough for most of my purposes, but I’m not interested in a one world view, whoever agrees with it. Even me.

If you are good with graphics, take the original Apple commercial:

and reverse it.

Show users and screen of vivid diversity and show a Richard Wallis look alike touching the side of the projection screen and the uniform grayness of linked data starts to spread across it. As it does, the users in the audience who have been in traditional dress start to look like the starting audience in Apple’s 1984 commercial.

That’s the intellectual landscape that linked data promises. Do you really want to go there?

Nothing against standards, I have helped write one or two them. But I do oppose uniformity for the sake of empowering self-appointed guardians.

Particularly when that uniformity is a tepid grey that doesn’t reflect the rich and discordant hues of human intellectual history.

Older Posts »

Powered by WordPress