Archive for the ‘Software Preservation’ Category

Software Heritage – Universal Software Archive – Indexing/Semantic Challenges

Sunday, July 24th, 2016

Software Heritage

From the homepage:

We collect and preserve software in source code form, because software embodies our technical and scientific knowledge and humanity cannot afford the risk of losing it.

Software is a precious part of our cultural heritage. We curate and make accessible all the software we collect, because only by sharing it we can guarantee its preservation in the very long term.
(emphasis in original)

The project has already collected:

Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:

  • public, non-fork repositories from GitHub
  • source packages from the Debian distribution (as of August 2015, via the snapshot service)
  • tarball releases from the GNU project (as of August 2015)

We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.

The charters of the planned working groups:

Extending the archive

Evolving the archive

Connecting the archive

Using the archive

on quick review did not seem to me to address the indexing/semantic challenges that searching such an archive will pose.

If you are familiar with the differences in metacharacters between different Unix programs, that is only a taste of the differences that will be faced when searching such an archive.

Looking forward to learning more about this project!

Data and Software Preservation for Open Science (DASPOS)

Monday, December 17th, 2012

I first read in: Preserving Science Data and Software for Open Science:

One of the emerging, and soon to be defining, characteristics of science research is the collection, usage and storage of immense amounts of data. In fields as diverse as medicine, astronomy and economics, large data sets are becoming the foundation for new scientific advances. A new project led by University of Notre Dame researchers will explore solutions to the problems of preserving data, analysis software and computational work flows, and how these relate to results obtained from the analysis of large data sets.

Titled “Data and Software Preservation for Open Science (DASPOS),” the National Science Foundation-funded $1.8 million program is focused on high energy physics data from the Large Hadron Collider (LHC) and the Fermilab Tevatron.

The research group, which is led by Mike Hildreth, a professor of physics; Jarek Nabrzyski, director of the Center for Research Computing with a concurrent appointment as associate professor of computer science and engineering; and Douglas Thain, associate professor of computer science and engineering, also will survey and incorporate the preservation needs of other research communities, such as astrophysics and bioinformatics, where large data sets and the derived results are becoming the core of emerging science in these disciplines.

Preservation of data and software semantics. Sounds like topic maps!

Materials you may find useful:

Status Report of the DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics (May 2012, Omitted the last 40 authors so I am omitting the first 50 authors. See the paper for the complete list.)

Data Preservation in High Energy Physics (December 2009, forerunner to the 2012 report)

DASPOS: Common Formats? by Mike Hildreth (slides, 19 November 2012)

DASPOS Overview by Mike Hildreth (slides, 20 November 2012)

Perhaps the most important statement from the 20 November slides:

A “scouting party”: push forward in what looks like a good direction without worrying about full world-wide consensus

I have participated in, seen, read about, any number of projects and well, this is quite refreshing.

Starting a project with or prematurely developing final answers is a guarantee of poor results.

Both science and the humanities explore to find answers. Why should developing standards be any different?

A great deal to be learned here, even if you are just listening in on the conversations.