Archive for the ‘Provenance’ Category


Thursday, March 20th, 2014


From the webpage:

PLUS is a system for capturing and managing provenance information, originally created at the MITRE Corporation.

Data provenance is “information that helps determine the derivation history of a data product…[It includes] the ancestral data product(s) from which this data product evolved, and the process of transformation of these ancestral data product(s).”

Uses Neo4j for storage.

Includes an academic bibliography of related papers.

Provenance answers the question: Where has your data been, what has happened to your data and with who?

Provenance Reconstruction Challenge 2014

Thursday, January 23rd, 2014

Provenance Reconstruction Challenge 2014


  • February 17, 2014 Test Data released
  • May 18, 2014 Last day to register for participation
  • May 19, 2014 Challenge Data released
  • June 13, 2014 Provenance Reconstruction Challenge Event at Provenance Week – Cologne Germany

From the post:

While the use of version control systems, workflow engines, provenance aware filesystems and databases, is growing there is still a plethora of data that lacks associated data provenance. To help solve this problem, a number of research groups have been looking at reconstructing the provenance of data using the computational environment in which it resides. This research however is still very new in the community. Thus, the aim the Provenance Reconstruction Challenge is to help spur research into the reconstruction of provenance by providing a common task and datasets for experimentation.

The Challenge

Challenge participants will receive an open data set and corresponding provenance graphs (in W3C PROV formant). They will then have several months to work with the data trying to reconstruct the provenance graphs from the open data set. 3 weeks before the challenge face-2-face event the participants will receive a new data set and a gold standard provenance graph. Participants are asked to register before the challenge dataset is released and to prepare a short description of their system to be placed online after the event.

The Event

At the event, we will have presentations of the results and the systems as well as a group conversation around the techniques used. The event will result in a joint report about techniques for reproducing provenance and paths forward.

For further information on the W3C PROV format:

Provenance Working Group

PROV at Semantic Web Wiki.

PROV Implementation Report (60 implementations as of 30 April 2013)

I first saw this in a tweet by Paul Groth.

Successful PROV Tutorial at EDBT

Friday, April 5th, 2013

Successful PROV Tutorial at EDBT by Paul Groth.

From the post:

On March 20th, 2013 members of the Provenance Working Group gave a tutorial on the PROV family of specifications at the EDBT conference in Genova, Italy. EDBT (“Extending Database Technology”) is widely regarded as one of the prime venues in Europe for dissemination of data management research.

The 1.5 hours tutorial was attended by about 26 participants, mostly from academia. It was structured into three parts of approximately the same length. The first two parts introduced PROV as a relational data model with constraints and inference rules, supported by a (nearly) relational notation (PROV-N). The third part presented known extensions and applications of PROV, based on the extensive PROV implementation report and implementations known to the presenter at the time.

All the presentation material is available here.

As the first part of the tutorial notes:

  • Provenance is not a new subject
    • workflow systems
    • databases
    • knowledge representation
    • information retrieval
  • Existing community-grown vocabularies
    • Open Provenance Model (OPM)
    • Dublin Core
    • Provenir ontology
    • Provenance vocabulary
    • SWAN provenance ontology
    • etc.

The existence of “other” vocabularies isn’t an issue for topic maps.

You can query on “your” vocabulary and obtain results from “other” vocabularies.

Enriches your information and that of others.

You will need to know about the vocabularies of others and their oddities.

For the W3C work on provenance, follow this tutorial and the others it mentions.

Dublin Core Mapping Comments [by 7 April 2013]

Monday, March 18th, 2013

Stuart Sutton, Managing Director, DCMI, calls on the Dublin Core community to comment on a mapping from Dublin Core terms to the PROV provenance ontology.

His call reads:

The DCMI Metadata Provenance Task Group [1] is collaborating with the W3C Provenance Working Group [2] on a mapping from Dublin Core terms to the PROV provenance ontology [3], currently a W3C Proposed Recommendation. More precisely, the document describes a partial mapping from DCMI Metadata Terms [4] to the PROV-O OWL2 ontology [5] — a set of classes and properties usable for representing and interchanging information about provenance. Numerous terms in the DCMI vocabulary provide information about the provenance of a resource. Translating these terms into PROV relates this information explicitly to the W3C provenance model.

The mapping is currently a W3C Working Draft. The final state of the document will be that of a W3C Note, to be published as part of a suite of documents in support of a W3C Recommendation for provenance interchange [6].

DCMI would like to point to the W3C Note as a DCMI Recommended Resource and therefore encourages the Dublin Core community to provide feedback and take part in the finalization of the mapping.

The deadline for all comments is 7 April 2013. We recommend that comments be provided directly to the public W3C list for comments: [7], ideally with a Cc: to DCMI’s dc-provenance list [8]. Comments sent only to the dc-provenance list will be summarized on the W3C list and addressed, and discussions on the W3C list will be summarized back on the dc-provenance list when appropriate.

Stuart Sutton, Managing Director, DCMI


Why Data Lineage is Your Secret … Weapon [Auditing Topic Maps]

Sunday, March 10th, 2013

Why Data Lineage is Your Secret Data Quality Weapon by Dylan Jones.

From the post:

Data lineage means many things to many people but it essentially refers to provenance – how do you prove where your data comes from?

It’s really a simple exercise. Just pull an imaginary string of data from where the information presents itself, back through the labyrinth of data stores and processing chains, until you can go no further.

I’m constantly amazed by why so few organisations practice sound data lineage management despite having fairly mature data quality or even data governance programs. On a side note, if ever there was a justification for the importance of data lineage management then just take a look at the brand damage caused by the recent European horse meat scandal.

But I digress. Why is data lineage your secret data quality weapon?

The simple answer is that data lineage forces your organisation to address two big issues that become all too apparent:

  • Lack of ownership
  • Lack of formal information chain design

Or to put it into a topic map context, can you trace what topics merged to create the topic you are now viewing?

And if you can’t trace, how can you audit the merging of topics?

And if you can’t audit, how do you determine the reliability of your topic map?

That is reliability in terms of date (freshness), source (reliable or not), evaluation (by screeners), comparison (to other sources), etc.

Same questions apply to all data aggregation systems.

Or as Mrs. Weasley tells Ginny:

“Never trust anything that can think for itself if you can’t see where it keeps its brain.”

Correction: Wesley -> Weasley. We had a minister friend over Sunday and were discussing the former, not the latter. 😉

At or Near Final Calls on W3C Provenance

Wednesday, October 3rd, 2012

I saw a notice today about the ontology part of the W3C work on provenance. Some of it is at final call or nearly so. If you are interested, see:

  • PROV-DM, the PROV data model for provenance;
  • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model;
  • PROV-N, a notation for provenance aimed at human consumption;
  • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of PROV to RDF;
  • PROV-AQ, the mechanisms for accessing and querying provenance;
  • PROV-PRIMER, a primer for the PROV data model.

My first impression is the provenance work is more complex than HTML 3.2 and therefore unlikely to see widespread adoption. (You may want to bookmark that link. It isn’t listed on the HTML page at the W3C, even under obsolete versions.)

How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms

Thursday, July 26th, 2012

How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms by Zhang, Qing Olive; Ko, Ryan K L; Kirchberg, Markus; Suen, Chun-Hui; Jagadpramana, Peter; Lee, Bu Sung.


As cloud computing and virtualization technologies become mainstream, the need to be able to track data has grown in importance. Having the ability to track data from its creation to its current state or its end state will enable the full transparency and accountability in cloud computing environments. In this paper, we showcase a novel technique for tracking end-to-end data provenance, a meta-data describing the derivation history of data. This breakthrough is crucial as it enhances trust and security for complex computer systems and communication networks. By analyzing and utilizing provenance, it is possible to detect various data leakage threats and alert data administrators and owners; thereby addressing the increasing needs of trust and security for customers’ data. We also present our rule-based data provenance tracing algorithms, which trace data provenance to detect actual operations that have been performed on files, especially those under the threat of leaking customers’ data. We implemented the cloud data provenance algorithms into an existing software with a rule correlation engine, show the performance of the algorithms in detecting various data leakage threats, and discuss technically its capabilities and limitations.

Interesting work but data provenance isn’t solely a cloud computing, virtualization issue.

Consider the ongoing complaints in Washington, D.C. on who leaked what to who and why?

All posturing to one side, that is a data provenance and subject identity based issue.

The sort of thing where a topic map application could excel.