Archive for the ‘RDF’ Category

Linked Data Repair and Certification

Saturday, June 27th, 2015

1st International Workshop on Linked Data Repair and Certification (ReCert 2015) is a half-day workshop at the 8th International Conference on Knowledge Capture (K-CAP 2015).

I know, not nearly as interesting as talking about Raquel Welch, but someone has to. 😉

From the post:

In recent years, we have witnessed a big growth of the Web of Data due to the enthusiasm shown by research scholars, public sector institutions and some private companies. Nevertheless, no rigorous processes for creating or mapping data have been systematically followed in most cases, leading to uneven quality among the different datasets available. Though low quality datasets might be adequate in some cases, these gaps in quality in different datasets sometimes hinder the effective exploitation, especially in industrial and production settings.

In this context, there are ongoing efforts in the Linked Data community to define the different quality dimensions and metrics to develop quality assessment frameworks. These initiatives have mostly focused on spotting errors as part of independent research efforts, sometimes lacking a global vision. Further, up to date, no significant attention has been paid to the automatic or semi-automatic repair of Linked Data, i.e., the use of unattended algorithms or supervised procedures for the correction of errors in linked data. Repaired data is susceptible of receiving a certification stamp, which together with reputation metrics of the sources can lead to having trusted linked data sources.

The goal of the Workshop on Linked Data Repair and Certification is to raise the awareness of dataset repair and certification techniques for Linked Data and to promote approaches to assess, monitor, maintain, improve, and certify Linked Data quality.

There is a call for papers with the following deadlines:

Paper submission: Monday, July 20, 2015

Acceptance Notification: Monday August 3, 2015

Camera-ready version: Monday August 10, 2015

Workshop: Monday October 7, 2015

Now that linked data exists, someone has to undertake the task of maintaining it. You could make links in linked data into topics in a topic map and add properties that would make them easier to match and maintain. Just a thought.

As far as “trusted link data sources,” I think the correct phrasing is: “less untrusted data sources than others.”

You know the phrase: “In God we trust, all others pay cash.”

Same is true for data. It may be a “trusted” source, but verify the data first, then trust.

SPARQL in 11 minutes (Bob DuCharme)

Monday, May 4th, 2015

From the description:

An introduction to the W3C query language for RDF. See http://www.learningsparql.com for more.

I first saw this in Bob DuCharme’s post: SPARQL: the video.

Nothing new for old hands but useful to pass on to newcomers.

I say nothing new, I did learn that Bob has a Korg Monotron synthesizer. Looking forward to more “accompanied” blog posts. 😉

Grafter RDF Utensil

Monday, March 16th, 2015

Grafter RDF Utensil

From the homepage:

grafter

Easy Data Curation, Creation and Conversion

Grafter’s DSL makes it easy to transform tabular data from one tabular format to another. We also provide ways to translate tabular data into Linked Graph Data.

Data Formats

Grafter currently supports processing CSV and Excel files, with additional formats including Geo formats & shape files coming soon.

Separation of Concerns

Grafter’s design has a clear separation of concerns, disentangling tabular data processing from graph conversion and loading.

Incanter Interoperability

Grafter uses Incanter’s datasets, making it easy to incororate advanced statistical processing features with your data transformation pipelines.

Stream Processing

Grafter transformations build on Clojure’s laziness, meaning you can process large amounts of data without worrying about memory.

Linked Data Templates

Describe the linked data you want to create using simple templates that look and feel like Turtle.

Even if Grafter wasn’t a DSL, written in Clojure, producing graph output, I would have been compelled to mention it because of the cool logo!

Enjoy!

I first saw this in a tweet by ClojureWerkz.

KDE and The Semantic Desktop

Saturday, March 14th, 2015

KDE and The Semantic Desktop by Vishesh Handa.

From the post:

During the KDE4 years the Semantic Desktop was one of the main pillars of KDE. Nepomuk was a massive, all encompassing, and integrated with many different part of KDE. However few people know what The Semantic Desktop was all about, and where KDE is heading.

History

The Semantic Desktop as it was originally envisioned comprised of both the technology and the philosophy behind The Semantic Web.

The Semantic Web is built on top of RDF and Graphs. This is a special way of storing data which focuses more on understanding what the data represents. This was primarily done by carefully annotating what everything means, starting with the definition of a resource, a property, a class, a thing, etc.

This process of all data being stored as RDF, having a central store, with applications respecting the store and following the ontologies was central to the idea of the Semantic Desktop.

The Semantic Desktop cannot exist without RDF. It is, for all intents and purposes, what the term “semantic” implies.

A brief post-mortem on the KDE Semantic Desktop which relied upon NEPOMUK (Networked Environment for Personal, Ontology-based Management of Unified Knowledge) for RDF-based features. (NEPOMUK was an EU project.)

The post mentions complexity more than once. A friend recently observed that RDF was all about supporting AI and not capturing arbitrary statements by a user.

Such as providing alternative identifiers for subjects. With enough alternative identifications (including context, which “scope” partially captures in topic maps), I suspect a deep learning application could do pretty well at subject recognition, including appropriate relationships (associations).

But that would not be by trying to guess or formulate formal rules (a la RDF/OWL) but by capturing the activities of users as they provide alternative identifications of and relationships for subjects.

Hmmm, merging then would be a learned behavior by our applications. Will have to give that some serious thought!

I first saw this in a tweet by Stefano Bertolo.

Blazegraph™ Selected by Wikimedia Foundation to Power the Wikidata Query Service

Sunday, March 8th, 2015

Blazegraph™ Selected by Wikimedia Foundation to Power the Wikidata Query Service by Brad Bebee.

From the post:

Blazegraph™ has been selected by the Wikimedia Foundation to be the graph database platform for the Wikidata Query Service. Read the Wikidata announcement here. Blazegraph™ was chosen over Titan, Neo4j, Graph-X, and others by Wikimedia in their evaluation. There’s a spreadsheet link in the selection message, which has quite an interesting comparison of graph database platforms.

Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. The Wikidata Query Service is a new capability being developed to allow users to be able to query and curate the knowledge base contained in Wikidata.

We’re super-psyched to be working with Wikidata and think it will be a great thing for Wikidata and Blazegraph™.

From the Blazegraph™ SourceForge page:

Blazegraph™is SYSTAP’s flagship graph database. It is specifically designed to support big graphs offering both Semantic Web (RDF/SPARQL) and Graph Database (tinkerpop, blueprints, vertex-centric) APIs. It is built on the same open source GPLv2 platform and maintains 100% binary and API compatibility with Bigdata®. It features robust, scalable, fault-tolerant, enterprise-class storage and query and high-availability with online backup, failover and self-healing. It is in production use with enterprises such as Autodesk, EMC, Yahoo7!, and many others. Blazegraph™ provides both embedded and standalone modes of operation.

Blazegraph has a High Availability and Scale Out architecture. It provides robust support for Semantic Web (RDF/SPARQ)L and Property Graph (Tinkerpop) APIs. Highly scalable Blazegraph graph can handle 50 Billion edges.

The Blazegraph wiki, which has forty-three (43) substantive links to further details on Blazegraph.

For an even deeper look, consider these white papers:

Enjoy!

SPARQLES: Monitoring Public SPARQL Endpoints

Sunday, February 15th, 2015

SPARQLES: Monitoring Public SPARQL Endpoints by Pierre-Yves Vandenbussche, Jürgen Umbrich, Aidan Hogan, and Carlos Buil-Aranda.

Abstract:

We describe SPARQLES: an online system that monitors the health of public SPARQL endpoints on the Web by probing them with custom-designed queries at regular intervals. We present the architecture of SPARQLES and the variety of analytics that it runs over public SPARQL endpoints, categorised by availability, discoverability, performance and interoperability. To motivate the system, we gives examples of some key questions about the health and maturation of public SPARQL endpoints that can be answered by the data it has collected in the past year(s). We also detail the interfaces that the system provides for human and software agents to learn more about the recent history and current state of an individual SPARQL endpoint or about overall trends concerning the maturity of all endpoints monitored by the system.

I started to pass on this article since it does date from 2009 but am now glad that I didn’t. The service is still active and can be found at: http://sparqles.okfn.org/.

The discoverability of SPARQL endpoints is reported to be:

sparql-discovery

From the article:

[VoID Description:] The Vocabulary of Interlinked Data-sets (VoID) [2] has become the de facto standard for describing RDF datasets (in RDF). The vocabulary allows for specifying, e.g., an OpenSearch description, the number of triples a dataset contains, the number of unique subjects, a list of properties and classes used, number of triples associated with each property (used as predicate), number of instances of a given class, number of triples used to describe all instances of a given class, predicates used to describe class instances, and so forth. Likewise, the description of the dataset is often enriched using external vocabulary, such as for licensing information.

[SD Description:] Endpoint capabilities – such as supported SPARQL version, query and update features, I/O formats, custom functions, and/or entailment regimes – can be described in RDF using the SPARQL 1.1 Service Description (SD) vocabulary, which became a W3C Recommendation in March 2013 [21]. Such descriptions, if made widely available, could help a client find public endpoints that support the features it needs (e.g., find SPARQL 1.1 endpoints)

No, I’m not calling your attention to this to pick on SPARQL, especially, but the lack of discoverability raises a serious issue for any information retrieval system that hopes to better the dumb luck searching.

Clearly SPARQL has the capability to increase discoverability, whether those mechanisms would be effective or not cannot be answered due to lack of use. So my first question is: Why aren’t the mechanisms of SPARQL being used to increase discoverability?

Or perhaps better, having gone to the trouble to construct a SPARQL endpoint, why aren’t people taking the next step to make them more discoverable?

Is it because discoverability benefits some remote and faceless user instead of those being called upon to make the endpoint more discoverable? In that sense, it is a lack of positive feedback for the person tasked with increasing discoverability?

I ask because if we can’t find the key to motivating people to increase the discoverability of information (SPARQL or no) then we are in serious trouble as the rate of big data continues to increase. The amount of data will continue to grow and discoverability continues to go down. That can’t be a happy circumstance for anyone interested in discovering information.

Suggestions?

I first saw this in a tweet by Ruben Verborgh.

RDF Stream Processing Workshop at ESWC2015

Saturday, February 7th, 2015

RDF Stream Processing Workshop at ESWC2015

May 31th, 2015 in Portoroz, Slovenia

Important dates:

Submission for EoI: Friday March 6, 2015
Notification of acceptance: Friday April 3, 2015
Workshop days: Sunday May 31, 2015

From the webpage:

Motivation

Data streams are an increasingly prevalent source of information in a wide range of domains and applications, e.g. environmental monitoring, disaster response, or smart cities. The RDF model is based on a traditional persisted-data paradigm, where the focus is on maintaining a bounded set of data items in a knowledge base. This paradigm does not fit the case of data streams, where data items flow continuously over time, forming unbounded sequences of data. In this context, the W3C RDF Stream Processing (RSP) Community Group has taken the task to explore the existing technical and theoretical proposals that incorporate streams to the RDF model, and to its query language, SPARQL. More concretely, one of the main goals of the RSP Group is to define a common, but extensible core model for RDF stream processing. This core model can serve as a starting point for RSP engines to be able to talk to each other and interoperate.

Goal

The goal of this workshop is to bring together interested members of the community to:

  • Demonstrate their latest advances in stream processing systems for RDF.
  • Foster discussion for agreeing on a core model and query language for RDF streams.
  • Involve and attract people from related research areas to actively participate in the RSP Community Group.

Each of these objectives will intensify interest and participation in the community to ultimately broaden its impact and allow for going towards a standardization process. As a result of this workshop the authors will contribute to the W3C RSP Community Group Report that will be published as part of the group activities.

As the world of technology continues to evolve and RDF does not, you have to admire the persistent of the RDF community in bolting RDF onto every new technical innovation.

I never thought the problem with RDF was with technological. No, rather the problem was: Why should I use your identifiers and relationships when I much prefer my own? Which include an implied basis I used to assign each identifier to a subject. The “implied” part being how we came to have multiple meanings for owl:sameAs. If I can’t see the “implied” part, I cannot agree or disagree with it.

CSV on the Web:… [ .csv 5,250,000, .rdf 72,700]

Thursday, January 8th, 2015

CSV on the Web: Metadata Vocabulary for Tabular Data, and Their Conversion to JSON and RDF

From the post:

The CSV on the Web Working Group has published First Public Working Drafts of the Generating JSON from Tabular Data on the Web and the Generating RDF from Tabular Data on the Web documents, and has also issued new releases of the Metadata Vocabulary for Tabular Data and the Model for Tabular Data and Metadata on the Web Working Drafts. A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. Validation, conversion, display, and search of that tabular data requires additional information on that data. The “Metadata vocabulary” document defines a vocabulary for metadata that annotates tabular data, providing such information as datatypes, linkage among different tables, license information, or human readable description of columns. The standard conversion of the tabular data to JSON and/or RDF makes use of that metadata to provide representations of the data for various applications. All these technologies rely on a basic data model for tabular data described in the “Model” document. The Working Group welcomes comments on these documents and on their motivating use cases. Learn more about the Data Activity.

These are working drafts and as such have a number of issues noted in the text of each one. Excellent opportunity to participate in the W3C process.

There aren’t any reliable numbers but searching for “.csv” returns 5,250,000 “hits” and searching on “.rdf” returns 72,700 “hits.”

That sound really low for CSV and doesn’t include all the CSV files on local systems.

Still, I would say that CSV files continue to be important and that this work merits your attention.

Review of Large-Scale RDF Data Processing in MapReduce

Wednesday, January 7th, 2015

Review of Large-Scale RDF Data Processing in MapReduce by Ke Hou, Jing Zhang and Xing Fang.

Abstract:

Resource Description Framework (RDF) is an important data presenting standard of semantic web and how to process, the increasing RDF data is a key problem for development of semantic web. MapReduce is a widely-used parallel programming model which can provide a solution to large-scale RDF data processing. This study reviews the recent literatures on RDF data processing in MapReduce framework in aspects of the forward-chaining reasoning, the simple querying and the storage mode determined by the related querying method. Finally, it is proposed that the future research direction of RDF data processing should aim at the scalable, increasing and complex RDF data query.

I count twenty-nine (29) projects with two to three sentence summaries of each one. Great starting point for an in-depth review of RDF data processing using mapreduce.

I first saw this in a tweet by Marin Dimitrov.

MarkLogic® 8…

Tuesday, November 18th, 2014

MarkLogic® 8 Evolves Database Technology to Solve Heterogeneous Data Integration Problems with the Power of Search, Semantics and Bitemporal Features All in One System

From the post:

MarkLogic Corporation, the leading Enterprise NoSQL database platform provider, today announced the availability of MarkLogic® Version 8 Early Access Edition. MarkLogic 8 brings together advanced search, semantics, bitemporal and native JavaScript support into one powerful, agile and trusted database platform. Companies can now:

  • Get better answers faster through integrated search and query of all of their data, metadata, and relationships, regardless of the data type or source;
  • Lower costs and increase agility by easily integrating heterogeneous data, including relational, unstructured, and richly structured data, across silos and at massive scale;
  • Rapidly build production-ready applications in weeks versus months or years to address the needs of the business or organization.

For enterprise customers who value agility but can’t compromise on resiliency, MarkLogic software is the only database platform that integrates Google-like search with rich query and semantics into an intelligent and extensible data layer that works equally well in a data center or in the cloud. Unlike other NoSQL solutions, MarkLogic provides ACID transactions, HA, DR, and other hardened features that enterprises require, along with the scalability and agility they need to accelerate their business.

“As more complex data, much of it semi-structured, becomes increasingly important to businesses’ daily operations, enterprises are realizing that they must look beyond relational databases to help them understand, integrate, and manage all of their data, deriving maximum value in a simple, yet sophisticated manner,” said Carl Olofson, research vice president at IDC. “MarkLogic has a history of bringing advanced data management technology to market and many of their customers and partners are accustomed to managing complex data in an agile manner. As a result, they have a more mature and creative view of how to manage and use data than do mainstream database users. MarkLogic 8 offers some very advanced tools and capabilities, which could expand the market’s definition of enterprise database technology.”

I’m not in the early release program but if you are, heads up!

By “semantics,” MarkLogic means RDF triples and the ability to query those triples with text, values, etc.

Since we can all see triples, text and values with different semantics, your semantic mileage with MarkLogic may vary greatly.

clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle

Tuesday, November 11th, 2014

clj-turtle: A Clojure Domain Specific Language (DSL) for RDF/Turtle by Frédéerick Giasson.

From the post:

Some of my recent work leaded me to heavily use Clojure to develop all kind of new capabilities for Structured Dynamics. The ones that knows us, knows that every we do is related to RDF and OWL ontologies. All this work with Clojure is no exception.

Recently, while developing a Domain Specific Language (DSL) for using the Open Semantic Framework (OSF) web service endpoints, I did some research to try to find some kind of simple Clojure DSL that I could use to generate RDF data (in any well-known serialization). After some time, I figured out that no such a thing was currently existing in the Clojure ecosystem, so I choose to create my simple DSL for creating RDF data.

The primary goal of this new project was to have a DSL that users could use to created RDF data that could be feed to the OSF web services endpoints such as the CRUD: Create or CRUD: Update endpoints.

What I choose to do is to create a new project called clj-turtle that generates RDF/Turtle code from Clojure code. The Turtle code that is produced by this DSL is currently quite verbose. This means that all the URIs are extended, that the triple quotes are used and that the triples are fully described.

This new DSL is mean to be a really simple and easy way to create RDF data. It could even be used by non-Clojure coder to create RDF/Turtle compatible data using the DSL. New services could easily be created that takes the DSL code as input and output the RDF/Turtle code. That way, no Clojure environment would be required to use the DSL for generating RDF data.

I mention Frédéerick’s DSL for RDF despite my doubts about RDF. Good or not, RDF has achieved the status of legacy data.

Analyzing Schema.org

Thursday, October 23rd, 2014

Analyzing Schema.org by Peter F. Patel-Schneider.

Abstract:

Schema.org is a way to add machine-understandable information to web pages that is processed by the major search engines to improve search performance. The definition of schema.org is provided as a set of web pages plus a partial mapping into RDF triples with unusual properties, and is incomplete in a number of places. This analysis of and formal semantics for schema.org provides a complete basis for a plausible version of what schema.org should be.

Peter’s analysis is summarized when he says:

The lack of a complete definition of schema.org limits the possibility of extracting the correct information from web pages that have schema.org markup.

Ah, yes, “…the correct information from web pages….”

I suspect the lack of semantic precision has powered the success of schema.org. Each user of schema.org markup has their private notion of the meaning of their use of the markup and there is no formal definition to disabuse them of that notion. Not that formal definitions were enough to save owl:sameAs from varying interpretations.

Schema.org empowers varying interpretations without requiring users to ignore OWL or description logic.

For the domains that schema.org covers, eateries, movies, bars, whore houses, etc., the semantic slippage permitted by schema.org lowers the bar to usage of its markup. Which has resulted in its adoption more widely than other proposals.

The lesson of schema.org is the degree of semantic slippage you can tolerate depends upon your domain. For pharmaceuticals, I would assume that degree of slippage is as close to zero as possible. For movie reviews, not so much.

Any effort to impose the same degree of semantic slippage across all domains is doomed to failure.

I first saw this in a tweet by Bob DuCharme.

LSD Dimensions

Monday, October 20th, 2014

LSD Dimensions

From the about page: http://lsd-dimensions.org/dimensions

LSD Dimensions is an observatory of the current usage of dimensions and codes in Linked Statistical Data (LSD).

LSD Dimensions is an aggregator of all qb:DimensionProperty resources (and their associated triples), as defined in the RDF Data Cube vocabulary (W3C recommendation for publishing statistical data on the Web), that can be currently found in the Linked Data Cloud (read: the SPARQL endpoints in Datahub.io). Its purpose is to improve the reusability of statistical dimensions, codes and concept schemes in the Web of Data, providing an interface for users (future work: also for programs) to search for resources commonly used to describe open statistical datasets.

Usage

The main view shows the count of queried SPARQL endpoints and the number of retrieved dimensions, together with a table that displays these dimensions.

  • Sorting. Dimensions can be sorted by their dimension URI, label and number of references (i.e. number of times a dimension is used in the endpoints) by clicking on the column headers.
  • Pagination. The number of rows per page can be customized and browsed by clicking at the bottom selectors.
  • Search. String-based search can be performed by writing the search query in the top search field.

Any of these dimensions can be further explored by clicking at the eye icon on the left. The dimension detail view shows

  • Endpoints.. The endpoints that make use of that dimension.
  • Codes. Popular codes that are defined (future work: also assigned) as valid values for that dimension.

Motivation

RDF Data Cube (QB) has boosted the publication of Linked Statistical Data (LSD) as Linked Open Data (LOD) by providing a means “to publish multi-dimensional data, such as statistics, on the web in such a way that they can be linked to related data sets and concepts”. QB defines cubes as sets of observations affected by dimensions, measures and attributes. For example, the observation “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years” has three dimensions (time period, with value 2004-2006; region, with value Newport; and sex, with value male), a measure (population life expectancy) and two attributes (the units of measure, years; and the metadata status, measured, to make explicit that the observation was measured instead of, for instance, estimated or interpolated). In some cases, it is useful to also define codes, a closed set of values taken by a dimension (e.g. sensible codes for the dimension sex could be male and female).

There is a vast diversity of domains to publish LSD about, and quite some dimensions and codes can be very heterogeneous, domain specific and hardly comparable. To this end, QB allows users to mint their own URIs to create arbitrary dimensions and associated codes. Conversely, some other dimensions and codes are quite common in statistics, and could be easily reused. However, publishers of LSD have no means to monitor the dimensions and codes currently used in other datasets published in QB as LOD, and consequently they cannot (a) link to them; nor (b) reuse them.

This is the motivation behind LSD Dimensions: it monitors the usage of existing dimensions and codes in LSD. It allows users to browse, search and gain insight into these dimensions and codes. We depict the diversity of statistical variables in LOD, improving their reusability.

(Emphasis added.)

The highlighted text:

There is a vast diversity of domains to publish LSD about, and quite some dimensions and codes can be very heterogeneous, domain specific and hardly comparable.

is the key isn’t it? If you can’t rely on data titles, users must examine the data and determine which sets can or should be compared.

The question then is how do you capture the information such users developed in making those decisions and pass it on to following users? Or do you just allow following users make their own way afresh?

If you document the additional information for each data set, by using a topic map, each use of this resource becomes richer for the following users. Richer or stays the same. Your call.

I first saw this in a tweet by Bob DuCharme. Who remarked this organization has a great title!

If you have made it this far, you realize that with all the data set, RDF and statistical language this isn’t the post you were looking for. 😉

PS: Yes Bob, it is a great title!

How To Build Linked Data APIs…

Wednesday, October 15th, 2014

This is the second high signal-to-noise presentation I have seen this week! I am sure that streak won’t last but I will enjoy it as long as it does.

Resources for after you see the presentation: Hydra: Hypermedia-Driven Web APIs, JSON for Linking Data, and, JSON-LD 1.0.

Near the end of the presentation, Marcus quotes Phil Archer, W3C Data Activity Lead:

Archer on Semantic Web

Which is an odd statement considering that JSON-LD 1.0 Section 7 Data Model, reads in part:

JSON-LD is a serialization format for Linked Data based on JSON. It is therefore important to distinguish between the syntax, which is defined by JSON in [RFC4627], and the data model which is an extension of the RDF data model [RDF11-CONCEPTS]. The precise details of how JSON-LD relates to the RDF data model are given in section 9. Relationship to RDF.

And section 9. Relationship to RDF reads in part:

JSON-LD is a concrete RDF syntax as described in [RDF11-CONCEPTS]. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize Generalized RDF Datasets. The JSON-LD extensions to the RDF data model are:…

Is JSON-LD “…a concrete RDF syntax…” where you can ignore RDF?

Not that I was ever a fan of RDF but standards should be fish or fowl and not attempt to be something in between.

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise

Tuesday, September 16th, 2014

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise by Cailin O’Connor.

From the post:

Is our behavior determined by genetics, or are we products of our environments? What matters more for the development of living things—internal factors or external ones? Biologists have been hotly debating these questions since shortly after the publication of Darwin’s theory of evolution by natural selection. Charles Darwin’s half-cousin Francis Galton was the first to try to understand this interplay between “nature and nurture” (a phrase he coined) by studying the development of twins.

But are nature and nurture the whole story? It seems not. Even identical twins brought up in similar environments won’t really be identical. They won’t have the same fingerprints. They’ll have different freckles and moles. Even complex traits such as intelligence and mental illness often vary between identical twins.

Of course, some of this variation is due to environmental factors. Even when identical twins are raised together, there are thousands of tiny differences in their developmental environments, from their position in the uterus to preschool teachers to junior prom dates.

But there is more to the story. There is a third factor, crucial to development and behavior, that biologists overlooked until just the past few decades: random noise.

In recent years, noise has become an extremely popular research topic in biology. Scientists have found that practically every process in cells is inherently, inescapably noisy. This is a consequence of basic chemistry. When molecules move around, they do so randomly. This means that cellular processes that require certain molecules to be in the right place at the right time depend on the whims of how molecules bump around. (bold emphasis added)

Is another word for “noise” chaos?

The sort of randomness that impacts our understanding of natural languages? That leads us to use different words for the same thing and the same word for different things?

The next time you see a semantically deterministic system be sure to ask if they have accounted for the impact of noise on the understanding of people using the system. 😉

To be fair, no system can but the pretense that noise doesn’t exist in some semantic environments (think description logic, RDF) is more than a little annoying.

You might want to start following the work of Cailin O’Connor (University of California, Irvine, Logic and Philosophy of Science).

Disclosure: I have always had a weakness for philosophy of science so your mileage may vary. This is real philosophy of science and not the strained crys of “science” you see on most mailing list discussions.

I first saw this in a tweet by John Horgan.

The Truth About Triplestores [Opaqueness]

Friday, August 22nd, 2014

The Truth About Triplestores

A vendor “truth” document from Ontotext. Not that being from a vendor is a bad thing, but you should always consider the source of a document when evaluating its claims.

Quite naturally I jumped to: “6. Data Integration & Identity Resolution: Identifying the same entity across disparate data sources.”

With so many different databases and systems existing inside any single organization, how do companies integrate all of their data? How do they recognize that an entity in one database is the same entity in a completely separate database?

Resolving identities across disparate sources can be tricky. First, they need to be identified and then linked.

To do this effectively, you need two things. Earlier, we mentioned that through the use of text analysis, the same entity spelled differently can be recognized. Once this happens, the references to entities need to be stored correctly in the triplestore. The triplestore needs to support predicates that can declare two different Universal Resource Indicators (URIs) as one in the same. By doing this, you can align the same real-world entity used in different data sources. The most standard and powerful predicate used to establish mappings between multiple URIs of a single object is owl:sameAs. In turn, this allows you to very easily merge information from multiple sources including linked open data or proprietary sources. The ability to recognize entities across multiple sources holds great promise helping to manage your data more effectively and pinpointing connections in your data that may be masked by slightly different entity references. Merging this information produces more accurate results, a clearer picture of how entities are related to one another and the ability to improve the speed with which your organization operates.

In case you are unfamiliar with owl:sameAS, here is an example from OWL Web Ontology Language Reference

<rdf:Description rdf:about="#William_Jefferson_Clinton">:
  <owl:sameAs rdf:resource="#BillClinton"/>
</rdf:Description>

The owl:sameAs in this case is opaque because there is no way to express why an author thought #William_Jefferson_Clinton and #BillClinton were about the same subject. You could argue that any prostitute in Columbia would recognize that mapping so let’s try a harder case.

<rdf:Description rdf:about="#United States of America">:
  <owl:sameAs rdf:resource="#الولايات المتحدة الأمريكية"/>
</rdf:Description>

Less confident than you were about the first one?

The problem with owl:sameAs is its opaqueness. You don’t know why an author used owl:sameAs. You don’t know what property or properties they saw that caused them to use one of the various understandings of owl:sameAs.

Without knowing those properties, accepting any owl:sameAs mapping is buying a pig in a poke. Not a proposition that interests me. You?

I first saw this in a tweet by graphityhq.

Exposing Resources in Datomic…

Wednesday, August 20th, 2014

Exposing Resources in Datomic Using Linked Data by Ratan Sebastian.

From the post:

Financial data feeds from various data providers tend to be closed off from most people due to high costs, licensing agreements, obscure documentation, and complicated business logic. The problem of understanding this data, and providing access to it for our application is something that we (and many others) have had to solve over and over again. Recently at Pellucid we were faced with three concrete problems

  1. Adding a new data set to make data visualizations with. This one was a high-dimensional data set and we were certain that the queries that would be needed to make the charts had to be very parameterizable.

  2. We were starting to come to terms with the difficulty of answering support questions about the data we use in our charts given that we were serving up the data using a Finagle service that spoke a binary protocol over TCP. Support staff should not have to learn Datomic’s highly expressive query language, Datalog or have to set up a Scala console to look at the raw data that was being served up.

  3. Different data sets that we use had semantically equivalent data that was being accessed in ways specific to that data set.

And as a long-term goal we wanted to be able to query across data sets instead of doing multiple queries and joining in memory.

These are very orthogonal goals to be sure. We embarked on a project which we thought might move us in those three directions simultaneously. We’d already ingested the data set from the raw file format into Datomic, which we love. Goal 2 was easily addressable by conveying data over a more accessible protocol. And what’s more accessible than REST. Goal 1 meant that we’d have expose quite a bit of Datalog expressivity to be able to write all the queries we needed. And Goal 3 hinted at the need for some way to talk about things in different data silos using a common vocabulary. Enter the Linked Data Platform. A W3C project, the need for which is brilliantly covered in this talk. What’s the connection? Wait for it…

The RDF Datomic Mapping

If you are happy with Datomic and RDF primitives, for semantic purposes, this may be all you need.

You have to appreciate Ratan’s closing sentiments:

We believe that a shared ontology of financial data could be very beneficial to many and open up the normally closeted world of handling financial data.

Even though we know as a practical matter that no “shared ontology of financial data” is likely to emerge.

In the absence of such a shared ontology, there are always topic maps.

DBpedia – Wikipedia Data Extraction

Friday, August 1st, 2014

DBpedia – Wikipedia Data Extraction by Gaurav Vaidya.

From the post:

We are happy to announce an experimental RDF dump of the Wikimedia Commons. A complete first draft is now available online at http://nl.dbpedia.org/downloads/commonswiki/20140705/, and will be eventually accesible from http://commons.dbpedia.org. A small sample dataset, which may be easier to browse, is available on Github at https://github.com/gaurav/commons-extraction/tree/master/commonswiki/20140101

Just in case you are looking for some RDF data to experiment with this weekend!

Web Annotation Working Group Charter

Wednesday, July 23rd, 2014

Web Annotation Working Group Charter

From the webpage:

Annotating, which is the act of creating associations between distinct pieces of information, is a widespread activity online in many guises but currently lacks a structured approach. Web citizens make comments about online resources using either tools built into the hosting web site, external web services, or the functionality of an annotation client. Readers of ebooks make use the tools provided by reading systems to add and share their thoughts or highlight portions of texts. Comments about photos on Flickr, videos on YouTube, audio tracks on SoundCloud, people’s posts on Facebook, or mentions of resources on Twitter could all be considered to be annotations associated with the resource being discussed.

The possibility of annotation is essential for many application areas. For example, it is standard practice for students to mark up their printed textbooks when familiarizing themselves with new materials; the ability to do the same with electronic materials (e.g., books, journal articles, or infographics) is crucial for the advancement of e-learning. Submissions of manuscripts for publication by trade publishers or scientific journals undergo review cycles involving authors and editors or peer reviewers; although the end result of this publishing process usually involves Web formats (HTML, XML, etc.), the lack of proper annotation facilities for the Web platform makes this process unnecessarily complex and time consuming. Communities developing specifications jointly, and published, eventually, on the Web, need to annotate the documents they produce to improve the efficiency of their communication.

There is a large number of closed and proprietary web-based “sticky note” and annotation systems offering annotation facilities on the Web or as part of ebook reading systems. A common complaint about these is that the user-created annotations cannot be shared, reused in another environment, archived, and so on, due to a proprietary nature of the environments where they were created. Security and privacy are also issues where annotation systems should meet user expectations.

Additionally, there are the related topics of comments and footnotes, which do not yet have standardized solutions, and which might benefit from some of the groundwork on annotations.

The goal of this Working Group is to provide an open approach for annotation, making it possible for browsers, reading systems, JavaScript libraries, and other tools, to develop an annotation ecosystem where users have access to their annotations from various environments, can share those annotations, can archive them, and use them how they wish.

Depending on how fine grained you want your semantics, annotation is one way to convey them to others.

Unfortunately, looking at the starting point for this working group, “open” means RDF, OWL and other non-commercially adopted technologies from the W3C.

Defining the ability to point, using XQuery perhaps and reserving to users the ability to create standards for annotation payloads would be a much more “open” approach. That is an approach you are unlikely to see from the W3C.

I would be more than happy to be proven wrong on that point.

RDFUnit

Tuesday, July 15th, 2014

RDFUnit – an RDF Unit-Testing suite

From the post:

RDFUnit is a test driven data-debugging framework that can run automatically generated (based on a schema) and manually generated test cases against an endpoint. All test cases are executed as SPARQL queries using a pattern-based transformation approach.

For more information on our methodology please refer to our report:

Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.

RDFUnit in a Nutshell

  • Test case: a data constraint that involves one or more triples. We use SPARQL as a test definition language.
  • Test suite: a set of test cases for testing a dataset
  • Status: Success, Fail, Timeout (complexity) or Error (e.g. network). A Fail can be an actual error, a warning or a notice
  • Data Quality Test Pattern (DQTP): Abstract test cases that can be intantiated into concrete test cases using pattern bindings
  • Pattern Bindings: valid replacements for a DQTP variable
  • Test Auto Generators (TAGs): Converts RDFS/OWL axioms into concrete test cases

If you are working with RDF data, this will certainly be helpful.

BTW, don’t miss the publications further down on the homepage for RDFUnit.

I first saw this in a tweet by Marin Dimitrov.

Ontology-Based Interpretation of Natural Language

Thursday, July 10th, 2014

Ontology-Based Interpretation of Natural Language by Philipp Cimiano, Christina Unger, John McCrae.

Authors’ description:

For humans, understanding a natural language sentence or discourse is so effortless that we hardly ever think about it. For machines, however, the task of interpreting natural language, especially grasping meaning beyond the literal content, has proven extremely difficult and requires a large amount of background knowledge.

The book Ontology-based interpretation of natural language presents an approach to the interpretation of natural language with respect to specific domain knowledge captured in ontologies. It puts ontologies at the center of the interpretation process, meaning that ontologies not only provide a formalization of domain knowlegde necessary for interpretation but also support and guide the construction of meaning representations.

The links under Resources for Ontologies, Lexica and Grammars, as of today return “coming soon.”

Implementations fares a bit better, returning information on various aspects of lemon.

lemon is a proposed meta-model for describing ontology lexica with RDF. It is declarative, thus abstracts from specific syntactic and semantic theories, and clearly separates lexicon and ontology. It follows the principle of semantics by reference, which means that the meaning of lexical entries is specified by pointing to elements in the ontology.

lemon-core

It may just be me but the Lemon model seems more complicated than asking users what identifies their subjects and distinguishes them from other subjects.

Lemon is said to be compatible with RDF, OWL, SPARQL, etc.

But, accurate (to a user) identification of subjects and their relationships to other subjects is more important to me than compatibility with RDF, SPARQL, etc.

You?

I first saw this in a tweet by Stefano Bertolo.

Revision of Serializing RDF Data…

Saturday, June 21st, 2014

Revision of Serializing RDF Data as Clojure Code Specification by Frédérick Giasson.

From the post:

In my previous blog post RDF Code: Serializing RDF Data as Clojure Code I did outline a first version of what a RDF serialization could look like if it would be serialized using Clojure code. However, after working with this proposal for two weeks, I found a few issues with the initial assumptions that I made that turned out to be bad design decisions in terms of Clojure code.

This blog post will discuss these issues, and I will update the initial set of rules that I defined in my previous blog post. Going forward, I will use the current rules as the way to serialize RDF data as Clojure code.

An example of where heavy data use with a proposal leads to its refinement!

Looking forward to more posts in this series.

Archive integration at Mattilsynet

Saturday, June 21st, 2014

Archive integration at Mattilsynet by Lars Marius Garshol (slides)

In addition to being on the path to become a prominent beer expert (see: http://www.garshol.priv.no/blog/), Lars Marius has long been involved in integration technologies in general and topic maps in particular.

These slides give a quick overview of a current integration project.

There is one point Lars makes that merits special attention:

No hard bindings from code to data model

  • code should have no knowledge of the data model
  • all data model-specific logic should be configuration
  • makes data changes much easier to handle

(slide 4)

Keep that in mind when evaluating ETL solutions. What is being hard coded?

PS: I was amused that Lars describes RDF as “Essentially a graph database….” True but the W3C starting marketing that claim only after graph databases had a surge in popularity.

Markup editors are manipulating directed acyclic graphs so I suppose they are graph editors as well. 😉

Foundations of an Alternative Approach to Reification in RDF

Tuesday, June 17th, 2014

Foundations of an Alternative Approach to Reification in RDF by Olaf Hartig and Bryan Thompson.

Abstract:

This document defines extensions of the RDF data model and of the SPARQL query language that capture an alternative approach to represent statement-level metadata. While this alternative approach is backwards compatible with RDF reification as defined by the RDF standard, the approach aims to address usability and data management shortcomings of RDF reification. One of the great advantages of the proposed approach is that it clarifies a means to (i) understand sparse matrices, the property graph model, hypergraphs, and other data structures with an emphasis on link attributes, (ii) map such data onto RDF, and (iii) query such data using SPARQL. Further, the proposal greatly expands both the freedom that database designers enjoy when creating physical indexing schemes and query plans for graph data annotated with link attributes and the interoperability of those database solutions.

The essence of the approach is to embed triples “in” triples that make statements about the embedded triples.

Works more efficiently than the standard RDF alternative but that’s hardly surprising.

Of course, you remain bound to lexical “sameness” as the identification for the embedded triple but I suppose fixing that would not be backwards compatible with the RDF standard.

I recommend this if you are working with RDF data. No point in it being any more inefficient than absolutely necessary.

PS: Reification is one of those terms that should be stricken from the CS vocabulary.

The question is: Can you make a statement about X? If the answer is no, there is no “reification” of X. Your information system cannot speak of X, which includes assigning any properties to X.

If the answer is yes, then the question is how do you identify X? Olaf and Bryan answer by saying “put a copy of X right here.” That’s one solution.

I first saw this in a tweet by Marin Dimitrov.

JSON-LD for software discovery…

Monday, June 16th, 2014

JSON-LD for software discovery, reuse and credit by Afron Smith.

From the post:

JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:

{ "name" : "Arfon" }

when there’s an entity called name you know that it means the name of a person and not a place.

If you haven’t heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.

One of the reasons JSON-LD is particularly exciting is that it’s a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API. JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they’re describing.

The YouTube video “What is JSON-LD?” by Manu Sporny makes an interesting point about the “ambiguity problem,” that is do you mean by “name” what I mean by “name” as a property?

At about time mark 5:36, Manu addresses the “ambiguity problem.”

The resolution of the ambiguity is to use a hyperlink as an identifier, the implication being that if we use the same identifier, we are talking about the same thing. (That isn’t true in real life, cf. the many meanings of owl:sameAS, but for simplicity sake, let’s leave that to one side.)

OK, what is the difference in both of us using the string “name” and both of us using the string “http://ex.com/name”? Both of them are opaque strings that either match or don’t. This just kicks the semantic can a little bit further down the road.

Let me use a better example from json-ld.org:

{
"@context": "http://json-ld.org/contexts/person.jsonld",
"@id": "http://dbpedia.org/resource/John_Lennon",
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}

If you follow http://json-ld.org/contexts/person.jsonld you will obtain a 2.4k JSON-LD file that contains (in part):

“Person”: “http://xmlns.com/foaf/0.1/Person

Following that link results in a webpage that reads in part:

The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.

and it is said to be:

Disjoint With: Project Organization

Ambiguity jumps back to the fore with: Something is a Person if it is a person.

What is that solipsism? Tautology?

There is no opportunity to say what properties are necessary to qualify as a “person” in the sense defined FOAF.

You may think that is nit-picking but without the ability to designate properties required to be a “person,” it isn’t possible to talk about U.S.C Title 42: 1983 civil rights actions where municipalities are held to be “persons” within the meaning of this law. That’s just one example. There are numerous variations on “person” for legal purposes.

You could argue that JSON-LD is for superficial or bubble-gum semantics but it is too useful a syntax for that fate.

Rather I would like to see JSON-LD to make ambiguity “manageable” by its users. True, you could define a “you know what I mean” document like FOAF, if that suits your purposes. On the other hand, you should be able to define required key/value pairs for any subject and for any key or value to extend an existing definition.

How far you need to go is on a case by case basis. For apps that display “AI” by tracking you and pushing more ads your way, FOAF may well be sufficient. For those of us with non-advertising driven interests, other diversions may await.

Emacs Settings for Clojure

Thursday, May 29th, 2014

My Optimal GNU Emacs Settings for Developing Clojure (so far) by Frédérick Giasson.

From the post:

In the coming months, I will start to publish a series of blog posts that will explain how RDF data can be serialized in Clojure code and more importantly what are the benefits of doing this. At Structured Dynamics, we started to invest resources into this research project and we believe that it will become a game changer regarding how people will consume, use and produce RDF data.

But I want to take a humble first step into this journey just by explaining how I ended up configuring Emacs for working with Clojure. I want to take the time to do this since this is a trials and errors process, and that it may be somewhat time-consuming for the new comers.

In an interesting twist for an article on Emacs, Frédérick recommends strongly that the reader consider Light Table as an IDE for Clojure over Emacs, especially if they are not already Emacs users.

What follows is a detailed description of changes for your .emacs file should you want to follow the Emacs route, including a LightTable theme for Emacs.

A very useful post and I am looking forward the the Clojure/RDF post to follow.

Workload Matters: Why RDF Databases Need a New Design

Saturday, May 17th, 2014

Workload Matters: Why RDF Databases Need a New Design by Gunes¸ Aluc¸, M. Tamer ¨ Ozsu, and, Khuzaima Daudjee.

Abstract:

The Resource Description Framework (RDF) is a standard for conceptually describing data on the Web, and SPARQL is the query language for RDF. As RDF is becoming widely utilized, RDF data management systems are being exposed to more diverse and dynamic workloads. Existing systems are workload-oblivious, and are therefore unable to provide consistently good performance. We propose a vision for a workload-aware and adaptive system. To realize this vision, we re-evaluate relevant existing physical design criteria for RDF and address the resulting set of new challenges.

The authors establish RDF data management systems are in need of better processing models. However, they mention a “prototype” only in their conclusion and offer no evidence concerning their possible alternatives for RDF processing.

I don’t doubt the need for better RDF processing but I would think the first step would be to determine the goals of RDF processing, separate and apart from the RDF model.

Simply because we conceptualize data as being encoded in “triples,” does not mean that computers must process them as “triples.” They can if it is advantageous but not if there are better processing models.

I first saw this in a tweet by Olaf Hartig.

WordNet RDF

Wednesday, April 16th, 2014

WordNet RDF

From the webpage:

WordNet is supported by the National Science Foundation under Grant Number 0855157. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the creators of WordNet and do not necessarily reflect the views of the National Science Foundation.

This is the RDF version of WordNet, created by mapping the existing WordNet data into RDF. The data is structured according to the lemon model. In addition, links have been added from the following sources:

These links increase the usefulness of the WordNet data. If you would like to contribute extra linking to WordNet please Contact us.

Curious if you find it easier to integrate WordNet RDF with other data or the more traditional WordNet?

I first saw this in a tweet by Bob DuCharme.

Bersys 2014!

Thursday, March 6th, 2014

Bersys 2014!

From the webpage:

Following the 1st International workshop on Benchmarking RDF Systems (BeRSys 2013) the aim of the BeRSys 2014 workshop is to provide a discussion forum where researchers and industrials can meet to discuss topics related to the performance of RDF systems. BeRSys 2014 is the only workshop dedicated to benchmarking different aspects of RDF engines – in the line of TPCTC series of workshops.The focus of the workshop is to expose and initiate discussions on best practices, different application needs and scenarios related to different aspects of RDF data management.

We will solicit contributions presenting experiences with benchmarking RDF systems, real-life RDF application needs which are good candidates for benchmarking, as well as novel ideas on developing benchmarks for different aspects of RDF data management ranging from query processing, reasoning to data integration. More specifically, we will welcome contributions from a diverse set of domain areas such as life science (bio-informatics, pharmaceutical), social networks, cultural informatics, news, digital forensics, e-science (astronomy, geology) and geographical among others. More specifically, the topics of interest include but are not limited to:

  • Descriptions of RDF data management use cases and query workloads
  • Benchmarks for RDF SPARQL 1.0 and SPARQL 1.1 query workloads
  • Benchmarks RDF data integration tasks including but not limited to ontology aligment, instance matching and ETL techniques
  • Benchmark metrics
  • Temporal and geospatial benchmarks
  • Evaluation of benchmark performance results on RDF engines
  • Benchmark principles
  • Query processing and optimization algorithms for RDF systems.

Venue:

The workshop is held in conjuction with the 40th International Conference on Very Large Data Bases (VLDB2014) in Hangzhou, China.

The only date listed on the announcement is September 1-5, 2014 for the workshop.

When other dates appear, I will update this post and re-post about the conference.

As you have seen in better papers on graphs, RDF, etc., benchmarking in this area is a perilous affair. Workshops, like this one, are one step towards building the experience necessary to consider the topic of benchmarking.

I first saw this in a tweet by Stefano Bertolo.

Biodiversity Information Standards

Tuesday, March 4th, 2014

Biodiversity Information Standards

From the webpage:

The most widely deployed formats for biodiversity occurrence data are Darwin Core (wiki) and ABCD (wiki).

The TDWG community’s priority is the deployment of Life Science Identifiers (LSID), the preferred Globally Unique Identifier technology and transitioning to RDF encoded metadata as defined by a set of simple vocabularies. All new projects should address the need for tagging their data with LSIDs and consider the use or development of appropriate vocabularies.

TDWG’s activities within the biodiversity informatics domain can be found in the Activities section of this website.

TDWG = Taxonomic Database Working Group.

I originally followed a link on “Darwin Core,” which sounded to much like another “D***** Core” to not check the reference.

Net result is two of the most popular formats used for biodiversity data.