Archive for the ‘Data Repositories’ Category

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories [But Who Pays?]

Sunday, September 13th, 2015

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories by Kirk Borne.

From the post:

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

The following seven V’s represent characteristics and challenges of open data:

  1. Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice: your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Open Data has many benefits when the 7 V’s are answered!

Kirk doesn’t address who pay the cost of the 7 V’s being answered.

The most obvious one for topic maps:

#5 Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use….

Yes, “…when you provide the data for others to use.” If I can use my data without documenting the semantics and schema (data models), who covers the cost of my creating that documentation and schemas?

In any sufficiently large enterprise, when you ask for assistance, the response will ask for the contract number to which the assistance should be billed.

If you know your Heinlein, then you know the acronym TANSTaaFL (“There ain’t no such thing as a free lunch”) and its application here is obvious.

Or should I say its application is obvious from the repeated calls for better documentation and models and the continued absence of the same?

Who do you think should be paying for better documentation and data models?

Data as “First Class Citizens”

Tuesday, February 10th, 2015

Data as “First Class Citizens” by Łukasz Bolikowski, Nikos Houssos, Paolo Manghi, Jochen Schirrwagen.

The guest editorial to D-Lib Magazine, January/February 2015, Volume 21, Number 1/2, introduces a collection of articles focusing on data as “first class citizens,” saying:

Data are an essential element of the research process. Therefore, for the sake of transparency, verifiability and reproducibility of research, data need to become “first-class citizens” in scholarly communication. Researchers have to be able to publish, share, index, find, cite, and reuse research data sets.

The 2nd International Workshop on Linking and Contextualizing Publications and Datasets (LCPD 2014), held in conjunction with the Digital Libraries 2014 conference (DL 2014), took place in London on September 12th, 2014 and gathered a group of stakeholders interested in growing a global data publishing culture. The workshop started with invited talks from Prof. Andreas Rauber (Linking to and Citing Data in Non-Trivial Settings), Stefan Kramer (Linking research data and publications: a survey of the landscape in the social sciences), and Dr. Iain Hrynaszkiewicz (Data papers and their applications: Data Descriptors in Scientific Data). The discussion was then organized into four full-paper sessions, exploring orthogonal but still interwoven facets of current and future challenges for data publishing: “contextualizing and linking” (Semantic Enrichment and Search: A Case Study on Environmental Science Literature and A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing), “new forms of publishing” (A Framework Supporting the Shift From Traditional Digital Publications to Enhanced Publications and Science 2.0 Repositories: Time for a Change in Scholarly Communication), “dataset citation” (Data Citation Practices in the CRAWDAD Wireless Network Data Archive, A Methodology for Citing Linked Open Data Subsets, and Challenges in Matching Dataset Citation Strings to Datasets in Social Science) and “dataset peer-review” (Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies and Data without Peer: Examples of Data Peer Review in the Earth Sciences).

We believe these investigations provide a rich overview of current issues in the field, by proposing open problems, solutions, and future challenges. In short they confirm the urgent and fascinating demands of research data publishing.

The only stumbling point in this collection of essays is the notion of data as “First Class Citizens.” Not that I object to a catchy title but not all data is going to be equal when it comes to first class citizenship.

Take Semantic Enrichment and Search: A Case Study on Environmental Science Literature, for example. Great essay on using multiple sources to annotate entities once disambiguation had occurred. But some entities are going to have more annotations than others and some entities may not be recognized at all. Not to mention it is rather doubtful that the markup containing those entities was annotated at all?

Are we sure we want to exclude from data the formats that contain the data? Isn’t a format a form of data? As well as the instructions for processing that data? Perhaps not in every case but shouldn’t data and the formats that hold date be equally treated as first class citizens? I am mindful that hundreds of thousands of people saw the pyramids being built but we have not one accurate report on the process.

Will the lack of that one accurate report deny us access to data quite skillfully preserved in a format that is no longer accessible to us?

While I support the cry for all data to be “first class citizens,” I also support a very broad notion of data to avoid overlooking data that may be critical in the future.

Over 1,000 research data repositories indexed in

Thursday, November 20th, 2014

Over 1,000 research data repositories indexed in

From the post:

In August 2012 – the Registry of Research Data Repositories went online with 23 entries. Two years later the registry provides researchers, funding organisations, libraries and publishers with over 1,000 listed research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. provides detailed information about the research data repositories, and its distinctive icons help researchers easily identify relevant repositories for accessing and depositing data sets.

To more than 5,000 unique visitors per month offers reliable orientation in the heterogeneous landscape of research data repositories. An average of 10 repositories are added to the registry every week. The latest indexed data infrastructure is the new CERN Open Data Portal.

Add to your short list of major data repositories!

EcoData Retriever

Sunday, October 5th, 2014

EcoData Retriever

From the webpage:

Most ecological datasets do not adhere to any agreed-upon standards in format, data structure or method of access. As a result acquiring and utilizing available datasets can be a time consuming and error prone process. The EcoData Retriever automates the tasks of finding, downloading, and cleaning up ecological data files, and then stores them in a local database. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days. Small datasets can be downloaded and installed in seconds and large datasets in minutes. The program also cleans up known issues with the datasets and automatically restructures them into standard formats before inserting the data into your choice of database management systems (Microsoft Access, MySQL, PostgreSQL, and SQLite, on Windows, Mac and Linux).

When faced with:

…datasets [that] do not adhere to any agreed-upon standards in format, data structure or method of access

you can:

  • Complain to fellow cube dwellers
  • Complain about data producers
  • Complain to the data producers
  • Create a solution to clean up and reformat the data as open source

Your choice?

I first saw this in a tweet by Dan McGlinn

Open source datacenter computing with Apache Mesos

Monday, September 15th, 2014

Open source datacenter computing with Apache Mesos by Sachin P. Bappalige.

From the post:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

Mesos leverages features of the modern kernel—”cgroups” in Linux, “zones” in Solaris—to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various web servers, databases and application servers.

This introduction to Apache Mesos will give you a quick overview of what Mesos has to offer without getting bogged down in details. Details will come later, either if you want to run a datacenter using Mesos or to map a datacenter being run with Mesos.

Publishing biodiversity data directly from GitHub to GBIF

Saturday, March 15th, 2014

Publishing biodiversity data directly from GitHub to GBIF by Roderic D. M. Page.

From the post:

Today I managed to publish some data from a GitHub repository directly to GBIF. Within a few minutes (and with Tim Robertson on hand via Skype to debug a few glitches) the data was automatically indexed by GBIF and its maps updated. You can see the data I uploaded here.

In case you don’t know about GBIF (I didn’t):

The Global Biodiversity Information Facility (GBIF) is an international open data infrastructure, funded by governments.

It allows anyone, anywhere to access data about all types of life on Earth, shared across national boundaries via the Internet.

By encouraging and helping institutions to publish data according to common standards, GBIF enables research not possible before, and informs better decisions to conserve and sustainably use the biological resources of the planet.

GBIF operates through a network of nodes, coordinating the biodiversity information facilities of Participant countries and organizations, collaborating with each other and the Secretariat to share skills, experiences and technical capacity.

GBIF’s vision: “A world in which biodiversity information is freely and universally available for science, society and a sustainable future.”

Roderic summarizes his post saying:

what I’m doing here is putting data on GitHub and having GBIF harvest that data directly from GitHub. This means I can edit the data, rebuild the Darwin Core Archive file, push it to GitHub, and GBIF will reindex it and update the data on the GBIF portal.

The process isn’t perfect but unlike disciplines where data sharing is the exception rather than the rule, the biodiversity community is trying to improve its sharing of data.

Every attempt at improvement will not succeed but lessons are learned from every attempt.

Kudos to the biodiversity community for a model that other communities should follow!

…The Registry

Saturday, January 4th, 2014

Making Research Data Repositories Visible: The Registry by Heinz Pampel, et. al.


Researchers require infrastructures that ensure a maximum of accessibility, stability and reliability to facilitate working with and sharing of research data. Such infrastructures are being increasingly summarized under the term Research Data Repositories (RDR). The project–Registry of Research Data Repositories–has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape. In July 2013 lists 400 research data repositories and counting. 288 of these are described in detail using the vocabulary. Information icons help researchers to easily identify an adequate repository for the storage and reuse of their data. This article describes the heterogeneous RDR landscape and presents a typology of institutional, disciplinary, multidisciplinary and project-specific RDR. Further the article outlines the features of, and shows how this registry helps to identify appropriate repositories for storage and search of research data.

A great summary of progress so far but pay close attention to:

In the following, the term research data is defined as digital data being a (descriptive) part or the result of a research process. This process covers all stages of research, ranging from research data generation, which may be in an experiment in the sciences, an empirical study in the social sciences or observations of cultural phenomena, to the publication of research results. Digital research data occur in different data types, levels of aggregation and data formats, informed by the research disciplines and their methods. With regards to the purpose of access for use and re-use of research data, digital research data are of no value without their metadata and proper documentation describing their context and the tools used to create, store, adapt, and analyze them [7]. (emphasis added)

If you think about that for a moment you will realize that should include all the “metadata and proper documentation …. and the tools….” The need for explanation does not go away because of the label “metadata” or “documentation.”

Not that we can ever avoid semantic opaqueness but depending on the value of the data, we can push it further away in some cases than others.

An article that will repay a close reading.

I first saw this in a tweet by Stuart Buck.

Data Repositories…

Thursday, November 14th, 2013

Data Repositories – Mother’s Milk for Data Scientists by Jerry A. Smith.

From the post:

Mothers are life givers, giving the milk of life. While there are so very few analogies so apropos, data is often considered the Mother’s Milk of Corporate Valuation. So, as a data scientist, we should treat dearly all those sources of data, understanding their place in the overall value chain of corporate existence.

A Data Repository is a logical (and sometimes physical) partitioning of data where multiple databases which apply to specific applications or sets of applications reside. For example, several databases (revenues, expenses) which support financial applications (A/R, A/P) could reside in a single financial Data Repository. Data Repositories can be found both internal (e.g., in data warehouses) and external (see below) to an organization. Here are a few repositories from KDnuggets that are worth taking a look at: (emphasis in original)

I count sixty-four (64) collections of data sets as of today.

What I haven’t seen, perhaps you have, is an index across the most popular data set collections that dedupes data sets and has thumb-nail information for each one.

Suggested indexes across data set collections?