Archive for the ‘Preservation’ Category

Software Heritage – Universal Software Archive – Indexing/Semantic Challenges

Sunday, July 24th, 2016

Software Heritage

From the homepage:

We collect and preserve software in source code form, because software embodies our technical and scientific knowledge and humanity cannot afford the risk of losing it.

Software is a precious part of our cultural heritage. We curate and make accessible all the software we collect, because only by sharing it we can guarantee its preservation in the very long term.
(emphasis in original)

The project has already collected:

Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:

  • public, non-fork repositories from GitHub
  • source packages from the Debian distribution (as of August 2015, via the snapshot service)
  • tarball releases from the GNU project (as of August 2015)

We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.

The charters of the planned working groups:

Extending the archive

Evolving the archive

Connecting the archive

Using the archive

on quick review did not seem to me to address the indexing/semantic challenges that searching such an archive will pose.

If you are familiar with the differences in metacharacters between different Unix programs, that is only a taste of the differences that will be faced when searching such an archive.

Looking forward to learning more about this project!

What is Scholarly HTML?

Saturday, October 31st, 2015

What is Scholarly HTML? by Robin Berjon and Sébastien Ballesteros.


Scholarly HTML is a domain-specific data format built entirely on open standards that enables the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is encoded as a document. It is, itself, written in Scholarly HTML.

The abstract is accurate enough but the “Motivation” section provides a better sense of this project:

Scholarly articles are still primarily encoded as unstructured graphics formats in which most of the information initially created by research, or even just in the text, is lost. This was an acceptable, if deplorable, condition when viable alternatives did not seem possible, but document technology has today reached a level of maturity and universality that makes this situation no longer tenable. Information cannot be disseminated if it is destroyed before even having left its creator’s laptop.

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

This is not solely a loss for the high principles of knowledge sharing in science, it also has very immediate pragmatic consequences. Any tool, any service that tries to integrate with scholarly publishing has to spend the brunt of its complexity (or budget) extracting data the author would have willingly shared out of antiquated formats. This places stringent limits on the improvement of the scholarly toolbox, on the discoverability of scientific knowledge, and particularly on processes of meta-analysis.

To address these issues, we have followed an approach rooted in established best practices for the reuse of open, standard formats. The «HTML Vernacular» body of practice provides guidelines for the creation of domain-specific data formats that make use of HTML’s inherent extensibility (Science.AI, 2015b). Using the vernacular foundation overlaid with «» metadata we have produced a format for the interchange of scholarly articles built on open standards, ready for all to use.

Our high-level goals were:

  • Uncompromisingly enabling structured metadata, accessibility, and internationalisation.
  • Pragmatically working in Web browsers, even if it occasionally incurs some markup overhead.
  • Powerfully customisable for inclusion in arbitrary Web sites, while remaining easy to process and interoperable.
  • Entirely built on top of open, royalty-free standards.
  • Long-term viability as a data format.

Additionally, in view of the specific problem we addressed, in the creation of this vernacular we have favoured the reliability of interchange over ease of authoring; but have nevertheless attempted to cater to the latter as much as possible. A decent boilerplate template file can certainly make authoring relatively simple, but not as radically simple as it can be. For such use cases, Scholarly HTML provides a great output target and overview of the data model required to support scholarly publishing at the document level.

An example of an authoring format that was designed to target Scholarly HTML as an output is the DOCX Standard Scientific Style which enables authors who are comfortable with Microsoft Word to author documents that have a direct upgrade path to semantic, standard content.

Where semantic modelling is concerned, our approach is to stick as much as possible to Beyond the obvious advantages there are in reusing a vocabulary that is supported by all the major search engines and is actively being developed towards enabling a shared understanding of many useful concepts, it also provides a protection against «ontological drift» whereby a new vocabulary is defined by a small group with insufficient input from a broader community of practice. A language that solely a single participant understands is of limited value.

In a small, circumscribed number of cases we have had to depart from, using the (prefixed with sa:) vocabulary instead (Science.AI, 2015a). Our goal is to work with in order to extend their vocabulary, and we will align our usage with the outcome of these discussions.

I especially enjoyed the observation:

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

I don’t doubt the truth of that story but after all, a large number of people are interested in baking cupcakes. Not more than three in many cases, are interested in reading any particular academic paper.

The use of will provide advantages for common concepts but to be truly useful for scholarly writing, it will require serious extension.

Take for example my post yesterday Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]. What microdata from would help readers find Propositionalisation and Aggregates, 2001, which describes substantially the same technique, without claims of surpassing human intuition? (Uncited by the authors the paper on deep feature synthesis.)

Or the 161 papers on propositionalisation that you can find at CiteSeer?

A crude classification that can be used by search engines is very useful but falls far short of the mark in terms of finding and retrieving scholarly writing.

Semantic uniformity for classifying scholarly content hasn’t been reached by scholars or librarians despite centuries of effort. Rather than taking up that Sisyphean task, let’s map across the ever increasing universe of semantic diversity.

What Should Remain After Retraction?

Thursday, April 30th, 2015

Antony Williams asks in a tweet:

If a paper is retracted shouldn’t it remain up but watermarked PDF as retracted? More than this?

Here is what you get instead of the front page:


A retraction should appear in bibliographic records maintained by the publisher as well as on any online version maintained by the publisher.

The Journal of the American Chemical Society (JACS) method of retraction, removal of the retracted content:

  • Presents a false view of the then current scientific context. Prior to retraction such an article is part of the overall scientific context in a field. Editing that context post-publication, is historical revisionism at its worst.
  • Interrupts the citation chain of publications cited in the retracted publication.
  • Leaves dangling citations of the retracted publication in later publications.
  • Places author who cited the retracted publication in an untenable position. Their citations of a retracted work are suspect with no opportunity to defend their citations.
  • Falsifies the memories of every reader who read the retracted publication. They cannot search for and retrieve that paper in order to revisit an idea, process or result sparked by the retracted publication.

Sound off to: Antony Williams (@ChemConnector) and @RetractionWatch

Let’s leave the creation of false histories to professionals, such as politicians.

On Excess: Susan Sontag’s Born-Digital Archive

Tuesday, October 28th, 2014

On Excess: Susan Sontag’s Born-Digital Archive by Jeremy Schmidt & Jacquelyn Ardam.

From the post:

In the case of the Sontag materials, the end result of Deep Freeze and a series of other processing procedures is a single IBM laptop, which researchers can request at the Special Collections desk at UCLA’s Research Library. That laptop has some funky features. You can’t read its content from home, even with a VPN, because the files aren’t online. You can’t live-Tweet your research progress from the laptop — or access the internet at all — because the machine’s connectivity features have been disabled. You can’t copy Annie Leibovitz’s first-ever email — “Mat and I just wanted to let you know we really are working at this. See you at dinner. xxxxxannie” (subject line: “My first Email”) — onto your thumb drive because the USB port is locked. And, clearly, you can’t save a new document, even if your desire to type yourself into recent intellectual history is formidable. Every time it logs out or reboots, the laptop goes back to ground zero. The folders you’ve opened slam shut. The files you’ve explored don’t change their “Last Accessed” dates. The notes you’ve typed disappear. It’s like you were never there.

Despite these measures, real limitations to our ability to harness digital archives remain. The born-digital portion of the Sontag collection was donated as a pair of external hard drives, and that portion is composed of documents that began their lives electronically and in most cases exist only in digital form. While preparing those digital files for use, UCLA archivists accidentally allowed certain dates to refresh while the materials were in “thaw” mode; the metadata then had to be painstakingly un-revised. More problematically, a significant number of files open as unreadable strings of symbols because the software with which they were created is long out of date. Even the fully accessible materials, meanwhile, exist in so many versions that the hapless researcher not trained in computer forensics is quickly overwhelmed.

No one would dispute the need for an authoritative copy of Sontag‘s archive, or at least as close to authoritative as humanly possible. The heavily protected laptop makes sense to me, assuming that the archive considers that to be the authoritative copy.

What has me puzzled, particularly since there are binary formats not recognized in the archive, is why isn’t a non-authoritative copy of the archive online. Any number of people may still possess the software necessary to read the files and/or be able to decrypt the file formats. That would be a net gain to the archive if recovery could be practiced on a non-authoritative copy. They may well encounter such files in the future.

After searching the Online Archive of California, I did encounter Finding Aid for the Susan Sontag papers, ca. 1939-2004 which reports:

Restrictions Property rights to the physical object belong to the UCLA Library, Department of Special Collections. Literary rights, including copyright, are retained by the creators and their heirs. It is the responsibility of the researcher to determine who holds the copyright and pursue the copyright owner or his or her heir for permission to publish where The UC Regents do not hold the copyright.

Availability Open for research, with following exceptions: Boxes 136 and 137 of journals are restricted until 25 years after Susan Sontag’s death (December 28, 2029), though the journals may become available once they are published.

Unfortunately, this finding aid does not mention Sontag’s computer or the transfer of the files to a laptop. A search of Melvyl (library catalog) finds only one archival collection and that is the one mentioned above.

I have written to the special collections library for clarification and will update this post when an answer arrives.

I mention this collection because of Sontag’s importance for a generation and because digital archives will soon be the majority of cases. One hopes the standard practice will be to donate all rights to an archival repository to insure its availability to future generations of scholars.

Bioinformatics tools extracted from a typical mammalian genome project

Monday, October 6th, 2014

Bioinformatics tools extracted from a typical mammalian genome project

From the post:

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

A must read if you are interested in useful preservation of research and data. The paper focuses on needed improvements in bioinformatics but the issues raised are common to all fields.

How well does your field perform when compared to bioinformatics?

The Books of Remarkable Women

Monday, March 10th, 2014

The Books of Remarkable Women by Sarah J. Biggs.

From the post:

In 2011, when we blogged about the Shaftesbury Psalter (which may have belonged to Adeliza of Louvain; see below), we wrote that medieval manuscripts which had belonged to women were relatively rare survivals. This still remains true, but as we have reviewed our blog over the past few years, it has become clear that we must emphasize the relative nature of the rarity – we have posted literally dozens of times about manuscripts that were produced for, owned, or created by a number of medieval women.

A good example of why I think topic maps have so much to offer for preservation of cultural legacy.

While each of the books covered in this post are important historical artifacts, their value is enhanced by the context of their production, ownership, contemporary practices, etc.

All of which lies outside the books proper. Just as data about data, the so-called “metadata,” usually lies outside its information artifact.

If future generations are going to have better historical context than we do for many items, we had best get started writing them.

…Digital Asset Sustainability…

Thursday, January 16th, 2014

A National Agenda Bibliography for Digital Asset Sustainability and Preservation Cost Modeling by Butch Lazorchak.

From the post:

The 2014 National Digital Stewardship Agenda, released in July 2013, is still a must-read (have you read it yet?). It integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development.

The Agenda suggests a number of important research areas for the digital stewardship community to consider, but the need for more coordinated applied research in cost modeling and sustainability is high on the list of areas prime for research and scholarship.

The section in the Agenda on “Applied Research for Cost Modeling and Audit Modeling” suggests some areas for exploration:

“Currently there are limited models for cost estimation for ongoing storage of digital content; cost estimation models need to be robust and flexible. Furthermore, as discussed below…there are virtually no models available to systematically and reliably predict the future value of preserved content. Different approaches to cost estimation should be explored and compared to existing models with emphasis on reproducibility of results. The development of a cost calculator would benefit organizations in making estimates of the long‐term storage costs for their digital content.”

In June of 2012 I put together a bibliography of resources touching on the economic sustainability of digital resources. I’m pleasantly surprised as all the new work that’s been done in the meantime, but as the Agenda suggests, there’s more room for directed research in this area. Or perhaps, as Paul Wheatley suggests in this blog post, what’s really needed are coordinated responses to sustainability challenges that build directly on this rich body of work, and that effectively communicate the results out to a wide audience.

I’ve updated the bibliography, hoping that researchers and funders will explore the existing body of projects, approaches and research, note the gaps in coverage suggested by the Agenda and make efforts to address the gaps in the near future through new research or funding.

I count some seventy-one (71) items in this bibliography.

Digital preservation is an area where topic maps can help maintain access over changing customs and vocabularies, but just like migrating from one form of media to another, it doesn’t happen by itself.

Nor is there any “free lunch,” because the data is culturally important, rare, etc. Someone has to pay the bill for it being preserved.

Having the cost of semantic access included in digital preservation would not hurt the cause of topic maps.


Hot Topics: The DuraSpace Community Webinar Series

Thursday, November 7th, 2013

Hot Topics: The DuraSpace Community Webinar Series

From the DuraSpace about page:

DuraSpace supported open technology projects provide long-term, durable access to and discovery of digital assets. We put together global, strategic collaborations to sustain DSpace and Fedora, two of the most widely-used repository solutions in the world. More than fifteen hundred institutions use and help develop these open source software repository platforms. DSpace and Fedora are directly supported with in-kind contributions of development resources and financial donations through the DuraSpace community sponsorship program.

Like most of you, I’m familiar with DSpace andFedora but I wasn’t familiar with the “Hot Topics” webinar series. I was following a link from Recommended! “Metadata and Repository Services for Research Data Curation” Webinar by Imma Subirats, when I encountered the “Hot Topics” page.

  • Series Six: Research Data in Repositories
  • Series Five: VIVO–Research Discovery and Networking
  • Series Four: Research Data Management Support
  • Series Three: Get a Head on Your Repository with Hydra End-to-End Solutions
  • Series Two: Managing and Preserving Audio and Video in your Digital Repository
  • Series One: Knowledge Futures: Digital Preservation Planning

Each series consists of three (3) webinars, all with recordings, most with slides as well.

Warning: Data curation doesn’t focus on the latest and coolest GPU processing techniques.

But, in ten to fifteen years when GPU techniques are like COBOL is now, good data curation will enable future students to access those techniques.

I think that is worthwhile.


Create and Manage Data: Training Resources

Thursday, May 2nd, 2013

Create and Manage Data: Training Resources

From the webpage:

Our Managing and Sharing Data: Training Resources present a suite of flexible training materials for people who are charged with training researchers and research support staff in how to look after research data.

The Training Resources complement the UK Data Archive’s popular guide on ‘Managing and Sharing Data: best practice for researchers’, the most recent version published in May 2011.

They  have been designed and used as part of the Archive’s daily work in supporting ESRC applicants and award holders and have been made possible by a grant from the ESRC Researcher Development Initiative (RDI).

The Training Resources are modularised following the UK Data Archive’s seven key areas of managing and sharing data:

  • sharing data – why and how
  • data management planning for researchers and research centres
  • documenting data
  • formatting data
  • storing data, including data security, data transfer, encryption, and file sharing
  • ethics and consent
  • data copyright

Each section contains:

  • introductory powerpoint(s)
  • presenter’s guide – where necessary
  • exercises and introduction to exercises
  • quizzes
  • answers

The materials are presented as used in our own training courses  and are mostly geared towards social scientists. We anticipate trainers will create their own personalised and more context-relevant example, for example by discipline, country, relevant laws and regulations.

You can download individual modules from the relevant sections or download the whole resource in pdf format. Updates to pages were last made on 20 June 2012.

Download all resources.

Quite an impressive set of materials that will introduce you to some aspects of research data in the UK. Not all but some aspects.

What you don’t learn here you will pickup from interaction with people actively engaged with research data.

But it will give you a head start on understanding the research data community.

Unlike some technologies, topic maps are more about a community’s world view than the world view of topic maps.

Ultimate library challenge: taming the internet

Saturday, April 6th, 2013

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

Making It Happen:…

Friday, March 22nd, 2013

Making It Happen: Sustainable Data Preservation and Use by Anita de Waard.

Great set of overview slides on why research data should be preserved.

Not to mention making the case that semantic diversity, in systems for capturing research data, between researchers, etc., needs to be addressed by any proffered solution.

If you don’t know Anita de Waard’s work, search for “Anita de Waard” on Slideshare.

As of today, I am getting one hundred and forty (140) presentations.

All of which you will find useful on a variety of data related topics.

….Comparing Digital Preservation Glossaries [Why Do We Need Common Vocabularies?]

Friday, August 10th, 2012

From AIP to Zettabyte: Comparing Digital Preservation Glossaries

Emily Reynolds (2012 Junior Fellow) writes:

As we mentioned in our introductory post last month, the OSI Junior Fellows are working on a project involving a draft digital preservation policy framework. One component of our work is revising a glossary that accompanies the framework. We’ve spent the last two weeks poring through more than two dozen glossaries relating to digital preservation concepts to locate and refine definitions to fit the terms used in the document.

We looked at dictionaries from well-established archival entities like the Society of American Archivists, as well as more strictly technical organizations like the Internet Engineering Task Force. While some glossaries take a traditional archival approach, others were more technical; we consulted documents primarily focusing on electronic records, archives, digital storage and other relevant fields. Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary. Based on what we found, that vocabulary will have to be broadly drawn and flexible to meet different kinds of requirements.

OSI = Office of Strategic Initiatives (Library of Congress)

Not to be overly critical, but I stumble over:

Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary.

Why does a “variety in the definitions for other terms…highlight[s] the need for a common vocabulary?”

I take it as a given that we have diverse vocabularies.

And that attempts at “common” vocabularies succeed in creating yet another “diverse” vocabulary.

So, why would anyone looking at “diverse” vocabularies jump to the conclusion that a “common” vocabulary is required?

Perhaps what is missing is the definition of the problem presented by “diverse” vocabularies.

Hard to solve a problem if you don’t know it is. (Hasn’t stopped some people that I know but that is a story for another day.)

I put it to you (and in your absence I will answer, so answer quickly):

What is the problem (or problems) presented by diverse vocabularies? (Feel free to use examples.)

Or if you prefer, Why do we need common vocabularies?

Digging into Data Challenge

Thursday, January 5th, 2012

Digging into Data Challenge

From the homepage:

What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

Interested to hear your comments on the projects as I am sure the projects would as well.

Lots of Copies – Keeps Stuff Safe / Is Insecure – LOCKSS / LOCII

Saturday, January 1st, 2011

Is your enterprise accidentally practicing Lots of Copies Keeps Stuff Safe – LOCKSS?

Gartner analyst Drue Reeves says:

Use document management to make sure you don’t have copies everywhere, and purge nonrelevant material.

If you fall into the lots of copies category your slogan should be: Lots of Copies Is Insecure or LOCII (pronounced “lossee”).

Not all document preservations solutions depend upon being insecure.

Topic maps can help develop strategies to make your document management solution less LOCII.

One way they can help is by mapping out all the duplicate copies. Are they really necessary?

Another way they can help is by showing who has access to each of those copies.

If you trust someone with access, that means you trust everyone they trust.

Check their Facebook or Linkedin pages to see how many other people you are trusting, just by trusting the first person.

Ask yourself: How bad would a Wikileaks like disclosure be?

Then get serious about information security and topic maps.

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval)

Saturday, October 23rd, 2010

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval).

From the website:

CASPAR methodological and technological solution:

  • is compliant to the OAIS Reference Model – the main standard of reference in digital preservation
  • is technology-neutral: the preservation environment could be implemented using any kind of emerging technology
  • adopts a distributed, asynchronous, loosely coupled architecture and each key component is self-contained and portable: it may be deployed without dependencies on different platform and framework
  • is domain independent: it could be applied with low additional effort to multiple domains/contexts.
  • preserves knowledge and intelligibility, not just the “bits”
  • guarantees the integrity and identity of the information preserved as well as the protection of digital rights

FYI: OAIS Reference Model

As a librarian, you will be confronted with claims similar to these in vendor literature, grant applications and other marketing materials.


  1. Pick one of these claims. What documentation/software produced by the project would you review to evaluate the claim you have chosen?
  2. What other materials do you think would be relevant to your review?
  3. Perform the actual review (10 – 15 pages, with citations, project)