Archive for the ‘Archives’ Category

Don’t Delete Evil Data [But Remember the Downside of “Evidence”]

Wednesday, February 14th, 2018

Don’t Delete Evil Data by Lam Thuy Vo.

From the post:

The web needs to be a friendlier place. It needs to be more truthful, less fake. It definitely needs to be less hateful. Most people agree with these notions.

There have been a number of efforts recently to enforce this idea: the Facebook groups and pages operated by Russian actors during the 2016 election have been deleted. None of the Twitter accounts listed in connection to the investigation of the Russian interference with the last presidential election are online anymore. Reddit announced late last fall that it was banning Nazi, white supremacist, and other hate groups.

But even though much harm has been done on these platforms, is the right course of action to erase all these interactions without a trace? So much of what constitutes our information universe is captured online—if foreign actors are manipulating political information we receive and if trolls turn our online existence into hell, there is a case to be made for us to be able to trace back malicious information to its source, rather than simply removing it from public view.

In other words, there is a case to be made to preserve some of this information, to archive it, structure it, and make it accessible to the public. It’s unreasonable to expect social media companies to sidestep consumer privacy protections and to release data attached to online misconduct willy-nilly. But to stop abuse, we need to understand it. We should consider archiving malicious content and related data in responsible ways that allow for researchers, sociologists, and journalists to understand its mechanisms better and, potentially, to demand more accountability from trolls whose actions may forever be deleted without a trace.

By some unspecified mechanism, I would support preservation of all social media. As well as have it publicly available, if it were publicly posted originally. Any restriction or permission to see/use the data will lead to the same abuses we see now.

Twitter, among others, talks about abuse but no one can prove or disprove whatever Twitter cares to say.

There is a downside to preserving social media. You have probably seen the NBC News story on 200,000 tweets that are the smoking gun on Russian interference with the 2016 elections.

Well, except that if you look at the tweets, that’s about as far from a smoking gun on Russian interference as anything you can imagine.

By analogy, that’s why intelligence analysts always say they have evidence and give you their conclusions, but not the evidence. Too much danger you will discover their report is completely fictional.

Or when not wholly fictional, serves their or their agency’s interest.

Keeping evidence is risky business. Just so you are aware.

If You See Something, Save Something (Poke A Censor In The Eye)

Thursday, August 17th, 2017

If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine by Alexis Rossi.

From the post:

In recent days many people have shown interest in making sure the Wayback Machine has copies of the web pages they care about most. These saved pages can be cited, shared, linked to – and they will continue to exist even after the original page changes or is removed from the web.

There are several ways to save pages and whole sites so that they appear in the Wayback Machine. Here are 6 of them.

In the comments, Ellen Spertus mentions a 7th way: Donate to the Internet Archive!

It’s the age of censorship, by governments, DMCA, the EU (right to be forgotten), Facebook, Google, Twitter and others.

Poke a censor in the eye, see something, save something to the Wayback Machine.

The Wayback Machine can’t stop all censorship, so save local and remote copies as well.

Keep poking until all censors go blind.

American Archive of Public Broadcasting

Saturday, June 17th, 2017

American Archive of Public Broadcasting

From the post:

An archive worth knowing about: The Library of Congress and Boston’s WGBH have joined forces to create The American Archive of Public Broadcasting and “preserve for posterity the most significant public television and radio programs of the past 60 years.” Right now, they’re overseeing the digitization of approximately 40,000 hours of programs. And already you can start streaming “more than 7,000 historic public radio and television programs.”

The collection includes local news and public affairs programs, and “programs dealing with education, environmental issues, music, art, literature, dance, poetry, religion, and even filmmaking.” You can browse the complete collection here. Or search the archive here. For more on the archive, read this About page.

Follow Open Culture on Facebook and Twitter and share intelligent media with your friends. Or better yet, sign up for our daily email and get a daily dose of Open Culture in your inbox.

If you’d like to support Open Culture and our mission, please consider making a donation to our site. It’s hard to rely 100% on ads, and your contributions will help us provide the best free cultural and educational materials.

Hopeful someone is spinning cable/television content 24 x 7 to archival storage. The ability to research and document, reliably, patterns in shows, advertisements, news reporting, etc., is more important than any speculative copyright interest.

Introducing arxiv-sanity

Sunday, September 18th, 2016

Only a small part of Arxiv appears at: but it is enough to show the feasibility of this approach.

What captures my interest is the potential to substitute/extend the program to use other similarity measures.

Bearing in mind that searching is only the first step towards the acquisition and preservation of knowledge.

PS: I first saw this in a tweet by Data Science Renee.

Software Heritage – Universal Software Archive – Indexing/Semantic Challenges

Sunday, July 24th, 2016

Software Heritage

From the homepage:

We collect and preserve software in source code form, because software embodies our technical and scientific knowledge and humanity cannot afford the risk of losing it.

Software is a precious part of our cultural heritage. We curate and make accessible all the software we collect, because only by sharing it we can guarantee its preservation in the very long term.
(emphasis in original)

The project has already collected:

Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:

  • public, non-fork repositories from GitHub
  • source packages from the Debian distribution (as of August 2015, via the snapshot service)
  • tarball releases from the GNU project (as of August 2015)

We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.

The charters of the planned working groups:

Extending the archive

Evolving the archive

Connecting the archive

Using the archive

on quick review did not seem to me to address the indexing/semantic challenges that searching such an archive will pose.

If you are familiar with the differences in metacharacters between different Unix programs, that is only a taste of the differences that will be faced when searching such an archive.

Looking forward to learning more about this project!

Cryptome – Happy 20th Anniversary!

Monday, June 20th, 2016


Cryptome marks 20 years, June 1996-2016, 100K dox thanx to 25K mostly anonymous doxers.

Donate $100 for the Cryptome Archive of 101,900 files from June 1996 to 25 May 2016 on 1 USB  (43.5GB). Cryptome public key.
(Search site with Google, or WikiLeaks for most not all.)

Bitcoin: 1P11b3Xkgagzex3fYusVcJ3ZTVsNwwnrBZ

Additional items on

Interesting post on fake Cryptome torrents:

$100 is a real bargain for the Cryptome Archive, plus you will be helping a worthy cause.

Repost the news of Cryptome 20th anniversary far and wide!


Making the most of The National Archives Library (webinar 29 March 2016)

Saturday, March 5th, 2016

Making the most of The National Archives Library

From the webpage:

This webinar will help you to make the most of The National Archives’ Library, with published works dating from the 16th century onwards. Among other topics, it will cover what the Library contains, why it is useful to use published sources before accessing archive records and how to access the Library catalogue.

Webinars are online only events.

The Library at The National Archives is holding a series of events to mark National Libraries Day. The National Archives’ Library is a rich resource that is accessible to all researchers.

We run an exciting range of events and exhibitions on a wide variety of topics. For more details, visit

Entrance to The National Archives is free and there is no need to book, see for more information.


Tuesday, 29 March 2016 from 16:00 to 17:00 (BST)

Assuming that 16:00 to 17:00 GMT was intended, that would be starting at 11 AM EST.

I have pinged the national archive on using BST, British Summer Time, in March. 😉

3 Decades of High Quality News! (code reuse)

Monday, February 1st, 2016

‘NewsHour’ archives to be digitized and available online by Dru Sefton.

From the post:

More than three decades of NewsHour are heading to an online home, the American Archive of Public Broadcasting.

Nearly 10,000 episodes that aired from 1975 to 2007 will be archived through a collaboration among AAPB; WGBH in Boston; WETA in Arlington, Va.; and the Library of Congress. The organizations jointly announced the project Thursday.

“The project will take place over the next year and a half,” WGBH spokesperson Emily Balk said. “The collection should be up in its entirety by mid-2018, but AAPB will be adding content from the collection to its website on an ongoing, monthly basis.”

Looking forward to that collection!

Useful on its own, but even more so if you had an indexical object that could point to a subject in a PBS news episode and at the same time, point to episodes on the same subject from other TV and radio news archives, not to mention the same subject in newspapers and magazines.

Oh, sorry, that would be a topic in ISO 13250-2 parlance and the more general concept of a proxy in ISO 13250-5. Thought I should mention that before someone at IBM runs off to patent another pre-existing idea.

I don’t suppose padding patent statistics hurts all that much, considering that the Supremes are poised to invalidate process and software patents in one fell swoop.

Hopefully economists will be ready to measure the amount of increased productivity (legal worries about and enforcement of process/software patents aren’t productive activities) from foreclosing even the potential of process or software patents.

Copyright is more than sufficient to protect source code, as is any programmer is going to use another programmers code. They say that scientists would rather use another scientist’s toothbrush that his vocabulary.

Copying another programmer’s code (code re-use) is more akin to sharing a condom. It’s just not done.

Teletext Time Travel [Extra Dirty Data]

Sunday, January 17th, 2016

Teletext Time Travel by Russ J. Graham.

From the post:

Transdiffusioner Jason Robertson has a complicated but fun project underway – recovering old teletext data from VHS cassettes.

Previously, it was possible – difficult but possible – to recover teletext from SVHS recordings, but they’re as rare as hen’s teeth as the format never really caught on. The data was captured by ordinary VHS but was never clear enough to get anything but a very few correct characters in amongst a massive amount of nonsense.

Technology is changing that. The continuing boom in processor power means it’s now possible to feed 15 minutes of smudged VHS teletext data into a computer and have it relentlessly compare the pages as they flick by at the top of the picture, choosing to hold characters that are the same on multiple viewing (as they’re likely to be right) and keep trying for clearer information for characters that frequently change (as they’re likely to be wrong).

I mention this so you the next time you complain about your “dirty data,” there is far dirtier data in the world!

Cultural Heritage Markup (Pre-Balisage)

Thursday, June 4th, 2015

Cultural Heritage Markup Balisage, Monday, August 10, 2015.

Do you remember visiting your great-aunt’s house? Where everything looked like museum pieces and the smell was worse than your room every got? And all the adults has strained smiles and said how happy they were to be there?

Well, cultural heritage markup isn’t like that. All the real cultural heritage stuff we have maiden aunts and Norwegian bachelor uncles to take care of that stuff. This pre-Balisage workshop is working with markup and is a lot more fun!

Hugh Cayless, Duke University introduces the workshop:

Cultural heritage materials are remarkable for their complexity and heterogenity. This often means that when you’ve solved one problem, you’ve solved one problem. Arrayed against this difficulty, we have a nice big pile of tools and technologies with an alphabet soup of names like XML, TEI, RDF, OAIS, SIP, DIP, XIP, AIP, and BIBFRAME, coupled with a variety of programming languages or storage and publishing systems. All of our papers today address in some way the question of how you deal with messy, complex, human data using the available toolsets and how those toolsets have to be adapted to cope with our data. How do you avoid having your solution dictated by the tools available? How do you know when you’re doing it right? Our speakers are all trying, in various ways, to reconfigure their tools or push past those tools’ limitations, and they are going to tell us how they’re doing it.

A large number of your emails, tweets, webpages, etc. are destined to be “cultural heritage” (phone calls too if the NSA has anything to say about it) so you better get on the cultural heritage markup train today!


Friday, May 1st, 2015

OPenn: Primary Digital Resources Available to All through Penn Libraries’ New Online Platform by Jessie Dummer.

From the post:

The Penn Libraries and the Schoenberg Institute for Manuscript Studies are thrilled to announce the launch of OPenn: Primary Resources Available to Everyone (, a new website that makes digitized cultural heritage material freely available and accessible to the public. OPenn is a major step in the Libraries’ strategic initiative to embrace open data, with all images and metadata on this site available as free cultural works to be freely studied, applied, copied, or modified by anyone, for any purpose. It is crucial to the mission of SIMS and the Penn Libraries to make these materials of great interest and research value easy to access and reuse. The OPenn team at SIMS has been working towards launching the website for the past year. Director Will Noel’s original idea to make our Medieval and Renaissance manuscripts open to all has grown into a space where the Libraries can collaborate with other institutions who want to open their data to the world.

Images of the manuscripts are currently available on OPenn at full resolution, with derivatives also provided for easy reuse on the web. Downloading, whether several select images or the entire dataset, is easily accomplished by following instructions or recipes posted in the Technical Read Me on OPenn. The website is designed to be machine-readable, but easy for individuals to use, too.

Oh, the manuscripts themselves?

Licensing is a real treat:

All images and their contents from the Lawrence J. Schoenberg Collection are free of known copyright restrictions and in the public domain. See the Creative Commons Public Domain Mark page for more information on terms of use:

Unless otherwise stated, all manuscript descriptions and other cataloging metadata are ©2015 The University of Pennsylvania Libraries. They are licensed for use under a Creative Commons Attribution Licensed version 4.0 (CC-BY-4.0):

For a description of the terms of use see, the Creative Commons Deed:

In substance and licensing such a departure from academic societies that still consider comping travel and hotel rooms as “fostering scholarship.” “Ye shall know them by their fruits.” (Matthew 7:16)

What Should Remain After Retraction?

Thursday, April 30th, 2015

Antony Williams asks in a tweet:

If a paper is retracted shouldn’t it remain up but watermarked PDF as retracted? More than this?

Here is what you get instead of the front page:


A retraction should appear in bibliographic records maintained by the publisher as well as on any online version maintained by the publisher.

The Journal of the American Chemical Society (JACS) method of retraction, removal of the retracted content:

  • Presents a false view of the then current scientific context. Prior to retraction such an article is part of the overall scientific context in a field. Editing that context post-publication, is historical revisionism at its worst.
  • Interrupts the citation chain of publications cited in the retracted publication.
  • Leaves dangling citations of the retracted publication in later publications.
  • Places author who cited the retracted publication in an untenable position. Their citations of a retracted work are suspect with no opportunity to defend their citations.
  • Falsifies the memories of every reader who read the retracted publication. They cannot search for and retrieve that paper in order to revisit an idea, process or result sparked by the retracted publication.

Sound off to: Antony Williams (@ChemConnector) and @RetractionWatch

Let’s leave the creation of false histories to professionals, such as politicians.

Galleries, Libraries, Archives, and Museums (GLAM CC Licensing)

Friday, March 6th, 2015

Galleries, Libraries, Archives, and Museums (GLAM CC Licensing)

A very extensive list of galleries, libraries, archives, and museums (GLAM) that are using CC licensing.

A good resource to have at hand if you need to argue for CC licensing with your gallerys, library, archive, or museum.

I first saw this in a tweet by Adrianne Russell.

Update: Resource List for March 5 Open Licensing Online Program

Unsustainable Museum Data

Friday, January 30th, 2015

Unsustainable Museum Data by Matthew Lincoln.

From the post:

In which I ask museums to give less API, more KISS and LOCKSS, please.

“How can we ensure our [insert big digital project title here] is sustainable?” So goes the cry from many a nascent digital humanities project, and rightly so! We should be glad that many new ventures are starting out by asking this question, rather than waiting until the last minute to come up with a sustainability plan. But Adam Crymble asks whether an emphasis on web-based digital projects instead of producing and sharing static data files is needlessly worsening our sustainability problem. Rather than allowing users to download the underlying data files (a passel of data tables, or marked-up text files, or even serialized linked data), these web projects mediate those data with user interfaces and guided searching, essentially making the data accessible to the casual user. But serving data piecemeal to users has its drawbacks, notes Crymble. If and when the web server goes down, access to the data disappears:

When something does go wrong we quickly realise it wasn’t the website we needed. It was the data, or it was the functionality. The online element, which we so often see as an asset, has become a liability.

I would broaden the scope of this call to include library and other data as well. Yes, APIs can be very useful but so can a copy of the original data.

Matthew mentions “creative re-use” near the end of his post but I would highlight that as a major reason for providing the original data. No doubt museums and others work very hard at offering good APIs of data but any API is only one way to obtain and view data.

For data, any data, to reach its potential, it needs to be available for multiple views of the same data. Some you may think are better, some you may think are worse than the original. But it is the potential for a multiplicity of views that opens up those possibilities. Keeping data behind an API is an act of preventing data from reaching its potential.

The Cobweb: Can the Internet be archived?

Monday, January 26th, 2015

The Cobweb: Can the Internet be archived? by Jill Lepore.

From the post:

Malaysia Airlines Flight 17 took off from Amsterdam at 10:31 A.M. G.M.T. on July 17, 2014, for a twelve-hour flight to Kuala Lumpur. Not much more than three hours later, the plane, a Boeing 777, crashed in a field outside Donetsk, Ukraine. All two hundred and ninety-eight people on board were killed. The plane’s last radio contact was at 1:20 P.M. G.M.T. At 2:50 P.M. G.M.T., Igor Girkin, a Ukrainian separatist leader also known as Strelkov, or someone acting on his behalf, posted a message on VKontakte, a Russian social-media site: “We just downed a plane, an AN-26.” (An Antonov 26 is a Soviet-built military cargo plane.) The post includes links to video of the wreckage of a plane; it appears to be a Boeing 777.

Two weeks before the crash, Anatol Shmelev, the curator of the Russia and Eurasia collection at the Hoover Institution, at Stanford, had submitted to the Internet Archive, a nonprofit library in California, a list of Ukrainian and Russian Web sites and blogs that ought to be recorded as part of the archive’s Ukraine Conflict collection. Shmelev is one of about a thousand librarians and archivists around the world who identify possible acquisitions for the Internet Archive’s subject collections, which are stored in its Wayback Machine, in San Francisco. Strelkov’s VKontakte page was on Shmelev’s list. “Strelkov is the field commander in Slaviansk and one of the most important figures in the conflict,” Shmelev had written in an e-mail to the Internet Archive on July 1st, and his page “deserves to be recorded twice a day.”

On July 17th, at 3:22 P.M. G.M.T., the Wayback Machine saved a screenshot of Strelkov’s VKontakte post about downing a plane. Two hours and twenty-two minutes later, Arthur Bright, the Europe editor of the Christian Science Monitor, tweeted a picture of the screenshot, along with the message “Grab of Donetsk militant Strelkov’s claim of downing what appears to have been MH17.” By then, Strelkov’s VKontakte page had already been edited: the claim about shooting down a plane was deleted. The only real evidence of the original claim lies in the Wayback Machine.

If you aren’t a daily user of the the Internet Archive (home of the WayBack Machine) you are missing out on a very useful resource.

Jill tells the story about the archive, its origins and challenges as well as I have heard it told. Very much worth your time to read.

Hopefully after reading the story you will find ways to contribute/support the Internet Archive.

Without the Internet Archive, the memory of the web would be distributed, isolated and in peril of erasure and neglect.

I am sure many governments and corporations wish the memory of the web could be altered, let’s disappoint them!

American Institute of Physics: Oral Histories

Monday, December 15th, 2014

American Institute of Physics: Oral Histories

From the webpage:

The Niels Bohr Library & Archives holds a collection of over 1,500 oral history interviews. These range in date from the early 1960s to the present and cover the major areas and discoveries of physics from the past 100 years. The interviews are conducted by members of the staff of the AIP Center for History of Physics as well as other historians and offer unique insights into the lives, work, and personalities of modern physicists.

Read digitized oral history transcripts online

I don’t have a large collection audio data-set (see: Shining a light into the BBC Radio archives) but there are lots of other people who do.

If you are teaching or researching physics for the last 100 years, this is a resource you should not miss.

Integrating audio resources such as this one, at less than the full recording level (think of it as audio transclusion), into teaching materials would be a great step forward. To say nothing of being about to incorporate such granular resources into a library catalog.

I did not find an interview with Edward Teller but a search of the transcripts turned up three hundred and five (305) “hits” where he is mentioned in interviews. A search for J. Robert Oppenheimer netted four hundred and thirty-six (436) results.

If you know your atomic bomb history, you can guess between Teller and Oppenheimer which one would support the “necessity” defense for the use of torture. It would be an interesting study to see how the interviewees saw these two very different men.

Shining a light into the BBC Radio archives

Monday, December 15th, 2014

Shining a light into the BBC Radio archives by Yves Raimond, Matt Hynes, and Rob Cooper.

From the post:


One of the biggest challenges for the BBC Archive is how to open up our enormous collection of radio programmes. As we’ve been broadcasting since 1922 we’ve got an archive of almost 100 years of audio recordings, representing a unique cultural and historical resource.

But the big problem is how to make it searchable. Many of the programmes have little or no meta-data, and the whole collection is far too large to process through human efforts alone.

Help is at hand. Over the last five years or so, technologies such as automated speech recognition, speaker identification and automated tagging have reached a level of accuracy where we can start to get impressive results for the right type of audio. By automatically analysing sound files and making informed decisions about the content and speakers, these tools can effectively help to fill in the missing gaps in our archive’s meta-data.

The Kiwi set of speech processing algorithms

COMMA is built on a set of speech processing algorithms called Kiwi. Back in 2011, BBC R&D were given access to a very large speech radio archive, the BBC World Service archive, which at the time had very little meta-data. In order to build our prototype around this archive we developed a number of speech processing algorithms, reusing open-source building blocks where possible. We then built the following workflow out of these algorithms:

  • Speaker segmentation, identification and gender detection (using LIUM diarization toolkitdiarize-jruby and ruby-lsh). This process is also known as diarisation. Essentially an audio file is automatically divided into segments according to the identity of the speaker. The algorithm can show us who is speaking and at what point in the sound clip.
  • Speech-to-text for the detected speech segments (using CMU Sphinx). At this point the spoken audio is translated as accurately as possible into readable text. This algorithm uses models built from a wide range of BBC data.
  • Automated tagging with DBpedia identifiers. DBpedia is a large database holding structured data extracted from Wikipedia. The automatic tagging process creates the searchable meta-data that ultimately allows us to access the archives much more easily. This process uses a tool we developed called ‘Mango’.


COMMA is due to launch some time in April 2015. If you’d like to be kept informed of our progress you can sign up for occasional email updates here. We’re also looking for early adopters to test the platform, so please contact us if you’re a cultural institution, media company or business that has large audio data-set you want to make searchable.

This article was written by Yves Raimond (lead engineer, BBC R&D), Matt Hynes (senior software engineer, BBC R&D) and Rob Cooper (development producer, BBC R&D)

I don’t have a large audio data-set but I am certainly going to be following this project. The results should be useful in and of themselves, to say nothing of being a good starting point for further tagging. I wonder if the BBC Sanskrit broadcasts are going to be available? I will have to check on that.

Without diminishing the achievements of other institutions, the efforts of the BBC, the British Library, and the British Museum are truly remarkable.

I first saw this in a tweet by Mike Jones.

Treasury Island: the film

Tuesday, November 25th, 2014

Treasury Island: the film by Lauren Willmott, Boyce Keay, and Beth Morrison.

From the post:

We are always looking to make the records we hold as accessible as possible, particularly those which you cannot search for by keyword in our catalogue, Discovery. And we are experimenting with new ways to do it.

The Treasury series, T1, is a great example of a series which holds a rich source of information but is complicated to search. T1 covers a wealth of subjects (from epidemics to horses) but people may overlook it as most of it is only described in Discovery as a range of numbers, meaning it can be difficult to search if you don’t know how to look. There are different processes for different periods dating back to 1557 so we chose to focus on records after 1852. Accessing these records requires various finding aids and multiple stages to access the papers. It’s a tricky process to explain in words so we thought we’d try demonstrating it.

We wanted to show people how to access these hidden treasures, by providing a visual aid that would work in conjunction with our written research guide. Armed with a tablet and a script, we got to work creating a video.

Our remit was:

  • to produce a video guide no more than four minutes long
  • to improve accessibility to these records through a simple, step-by–step process
  • to highlight what the finding aids and documents actually look like

These records can be useful to a whole range of researchers, from local historians to military historians to social historians, given that virtually every area of government action involved the Treasury at some stage. We hope this new video, which we intend to be watched in conjunction with the written research guide, will also be of use to any researchers who are new to the Treasury records.

Adding video guides to our written research guides are a new venture for us and so we are very keen to hear your feedback. Did you find it useful? Do you like the film format? Do you have any suggestions or improvements? Let us know by leaving a comment below!

This is a great illustration that data management isn’t something new. The Treasury Board has kept records since 1557 and has accumulated a rather extensive set of materials.

The written research guide looks interesting but since I am very unlikely to ever research Treasury Board records, I am unlikely to need it.

However, the authors have anticipated that someone might be interested in process of record keeping itself and so provided this additional reference:

Thomas L Heath, The Treasury (The Whitehall Series, 1927, GP Putnam’s Sons Ltd, London and New York)

That would be an interesting find!

I first saw this in a tweet by Andrew Janes.

On Excess: Susan Sontag’s Born-Digital Archive

Tuesday, October 28th, 2014

On Excess: Susan Sontag’s Born-Digital Archive by Jeremy Schmidt & Jacquelyn Ardam.

From the post:

In the case of the Sontag materials, the end result of Deep Freeze and a series of other processing procedures is a single IBM laptop, which researchers can request at the Special Collections desk at UCLA’s Research Library. That laptop has some funky features. You can’t read its content from home, even with a VPN, because the files aren’t online. You can’t live-Tweet your research progress from the laptop — or access the internet at all — because the machine’s connectivity features have been disabled. You can’t copy Annie Leibovitz’s first-ever email — “Mat and I just wanted to let you know we really are working at this. See you at dinner. xxxxxannie” (subject line: “My first Email”) — onto your thumb drive because the USB port is locked. And, clearly, you can’t save a new document, even if your desire to type yourself into recent intellectual history is formidable. Every time it logs out or reboots, the laptop goes back to ground zero. The folders you’ve opened slam shut. The files you’ve explored don’t change their “Last Accessed” dates. The notes you’ve typed disappear. It’s like you were never there.

Despite these measures, real limitations to our ability to harness digital archives remain. The born-digital portion of the Sontag collection was donated as a pair of external hard drives, and that portion is composed of documents that began their lives electronically and in most cases exist only in digital form. While preparing those digital files for use, UCLA archivists accidentally allowed certain dates to refresh while the materials were in “thaw” mode; the metadata then had to be painstakingly un-revised. More problematically, a significant number of files open as unreadable strings of symbols because the software with which they were created is long out of date. Even the fully accessible materials, meanwhile, exist in so many versions that the hapless researcher not trained in computer forensics is quickly overwhelmed.

No one would dispute the need for an authoritative copy of Sontag‘s archive, or at least as close to authoritative as humanly possible. The heavily protected laptop makes sense to me, assuming that the archive considers that to be the authoritative copy.

What has me puzzled, particularly since there are binary formats not recognized in the archive, is why isn’t a non-authoritative copy of the archive online. Any number of people may still possess the software necessary to read the files and/or be able to decrypt the file formats. That would be a net gain to the archive if recovery could be practiced on a non-authoritative copy. They may well encounter such files in the future.

After searching the Online Archive of California, I did encounter Finding Aid for the Susan Sontag papers, ca. 1939-2004 which reports:

Restrictions Property rights to the physical object belong to the UCLA Library, Department of Special Collections. Literary rights, including copyright, are retained by the creators and their heirs. It is the responsibility of the researcher to determine who holds the copyright and pursue the copyright owner or his or her heir for permission to publish where The UC Regents do not hold the copyright.

Availability Open for research, with following exceptions: Boxes 136 and 137 of journals are restricted until 25 years after Susan Sontag’s death (December 28, 2029), though the journals may become available once they are published.

Unfortunately, this finding aid does not mention Sontag’s computer or the transfer of the files to a laptop. A search of Melvyl (library catalog) finds only one archival collection and that is the one mentioned above.

I have written to the special collections library for clarification and will update this post when an answer arrives.

I mention this collection because of Sontag’s importance for a generation and because digital archives will soon be the majority of cases. One hopes the standard practice will be to donate all rights to an archival repository to insure its availability to future generations of scholars.

Bioinformatics tools extracted from a typical mammalian genome project

Monday, October 6th, 2014

Bioinformatics tools extracted from a typical mammalian genome project

From the post:

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

A must read if you are interested in useful preservation of research and data. The paper focuses on needed improvements in bioinformatics but the issues raised are common to all fields.

How well does your field perform when compared to bioinformatics?

Hello Again

Friday, May 30th, 2014

We Are Now In Command of the ISEE-3 Spacecraft by Keith Cowing.

From the post:

The ISEE-3 Reboot Project is pleased to announce that our team has established two-way communication with the ISEE-3 spacecraft and has begun commanding it to perform specific functions. Over the coming days and weeks our team will make an assessment of the spacecraft’s overall health and refine the techniques required to fire its engines and bring it back to an orbit near Earth.

First Contact with ISEE-3 was achieved at the Arecibo Radio Observatory in Puerto Rico. We would not have been able to achieve this effort without the gracious assistance provided by the entire staff at Arecibo. In addition to the staff at Arecibo, our team included simultaneous listening and analysis support by AMSAT-DL at the Bochum Observatory in Germany, the Space Science Center at Morehead State University in Kentucky, and the SETI Institute’s Allen Telescope Array in California.

How’s that for engineering and documentation?

So, maybe good documentation isn’t such a weird thing after all. 😉

Nomad and Historic Information

Thursday, May 22nd, 2014

You may remember Nomad from the Star Trek episode The Changeling. Not quite on that scale but NASA has signed an agreement to allow citizen scientists to “wake up” a thirty-five (35) year old spacecraft this next August.

NASA has given a green light to a group of citizen scientists attempting to breathe new scientific life into a more than 35-year old agency spacecraft.

The agency has signed a Non-Reimbursable Space Act Agreement (NRSAA) with Skycorp, Inc., in Los Gatos, California, allowing the company to attempt to contact, and possibly command and control, NASA’s International Sun-Earth Explorer-3 (ISEE-3) spacecraft as part of the company’s ISEE-3 Reboot Project. This is the first time NASA has worked such an agreement for use of a spacecraft the agency is no longer using or ever planned to use again.

The NRSAA details the technical, safety, legal and proprietary issues that will be addressed before any attempts are made to communicate with or control the 1970’s-era spacecraft as it nears the Earth in August.

“The intrepid ISEE-3 spacecraft was sent away from its primary mission to study the physics of the solar wind extending its mission of discovery to study two comets.” said John Grunsfeld, astronaut and associate administrator for the Science Mission Directorate at NASA headquarters in Washington. “We have a chance to engage a new generation of citizen scientists through this creative effort to recapture the ISEE-3 spacecraft as it zips by the Earth this summer.” NASA Signs Agreement with Citizen Scientists Attempting to Communicate with Old Spacecraft

Do you have any thirty-five (35) year old software you would like to start re-using? 😉

What information should you have captured for that software?

The crowdfunding is in “stretch mode,” working towards $150,000. Support at: ISEE-3 Reboot Project by Space College, Skycorp, and SpaceRef.

Data Mining the Internet Archive Collection [Librarians Take Note]

Wednesday, March 12th, 2014

Data Mining the Internet Archive Collection by Caleb McDaniel.

From the “Lesson Goals:”

The collections of the Internet Archive (IA) include many digitized sources of interest to historians, including early JSTOR journal content, John Adams’s personal library, and the Haiti collection at the John Carter Brown Library. In short, to quote Programming Historian Ian Milligan, “The Internet Archive rocks.”

In this lesson, you’ll learn how to download files from such collections using a Python module specifically designed for the Internet Archive. You will also learn how to use another Python module designed for parsing MARC XML records, a widely used standard for formatting bibliographic metadata.

For demonstration purposes, this lesson will focus on working with the digitized version of the Anti-Slavery Collection at the Boston Public Library in Copley Square. We will first download a large collection of MARC records from this collection, and then use Python to retrieve and analyze bibliographic information about items in the collection. For example, by the end of this lesson, you will be able to create a list of every named place from which a letter in the antislavery collection was written, which you could then use for a mapping project or some other kind of analysis.

This rocks!

In particular for librarians and library students who will already be familiar with MARC records.

Some 7,000 items from the Boston Public Library’s anti-slavery collection at Copley Square are the focus of this lesson.

That means historians have access to rich metadata, full images, and partial descriptions for thousands of antislavery letters, manuscripts, and publications.

Would original anti-slavery materials, written by actual participants, have interested you as a student? Do you think such materials would interest students now?

I first saw this in a tweet by Gregory Piatetsky.

Cataloguing projects

Tuesday, March 11th, 2014

Cataloguing projects (UK National Archive)

From the webpage:

The National Archives’ Cataloguing Strategy

The overall objective of our cataloguing work is to deliver more comprehensive and searchable catalogues, thus improving access to public records. To make online searches work well we need to provide adequate data and prioritise cataloguing work that tackles less adequate descriptions. For example, we regard ranges of abbreviated names or file numbers as inadequate.

I was lead to this delightful resource by a tweet from David Underdown, advising that his presentation from National Catalogue Day in 2013 was now onlne.

His presentation along with several others and reports about projects in prior years are available at this projects page.

I thought the presentation titled: Opening up of Litigation: 1385-1875 by Amanda Bevan and David Foster, was quite interesting in light of various projects that want to create new “public” citation systems for law and litigation.

I haven’t seen such a proposal yet that gives sufficient consideration to the enormity of what do you do with old legal materials?

The litigation presentation could be a poster child for topic maps.

I am looking forward to reading the other presentations as well.

…Digital Asset Sustainability…

Thursday, January 16th, 2014

A National Agenda Bibliography for Digital Asset Sustainability and Preservation Cost Modeling by Butch Lazorchak.

From the post:

The 2014 National Digital Stewardship Agenda, released in July 2013, is still a must-read (have you read it yet?). It integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development.

The Agenda suggests a number of important research areas for the digital stewardship community to consider, but the need for more coordinated applied research in cost modeling and sustainability is high on the list of areas prime for research and scholarship.

The section in the Agenda on “Applied Research for Cost Modeling and Audit Modeling” suggests some areas for exploration:

“Currently there are limited models for cost estimation for ongoing storage of digital content; cost estimation models need to be robust and flexible. Furthermore, as discussed below…there are virtually no models available to systematically and reliably predict the future value of preserved content. Different approaches to cost estimation should be explored and compared to existing models with emphasis on reproducibility of results. The development of a cost calculator would benefit organizations in making estimates of the long‐term storage costs for their digital content.”

In June of 2012 I put together a bibliography of resources touching on the economic sustainability of digital resources. I’m pleasantly surprised as all the new work that’s been done in the meantime, but as the Agenda suggests, there’s more room for directed research in this area. Or perhaps, as Paul Wheatley suggests in this blog post, what’s really needed are coordinated responses to sustainability challenges that build directly on this rich body of work, and that effectively communicate the results out to a wide audience.

I’ve updated the bibliography, hoping that researchers and funders will explore the existing body of projects, approaches and research, note the gaps in coverage suggested by the Agenda and make efforts to address the gaps in the near future through new research or funding.

I count some seventy-one (71) items in this bibliography.

Digital preservation is an area where topic maps can help maintain access over changing customs and vocabularies, but just like migrating from one form of media to another, it doesn’t happen by itself.

Nor is there any “free lunch,” because the data is culturally important, rare, etc. Someone has to pay the bill for it being preserved.

Having the cost of semantic access included in digital preservation would not hurt the cause of topic maps.


Aberdeen – 1398 to Present

Sunday, December 15th, 2013

A Text Analytic Approach to Rural and Urban Legal Histories

From the post:

Aberdeen has the earliest and most complete body of surviving records of any Scottish town, running in near-unbroken sequence from 1398 to the present day. Our central focus is on the ‘provincial town’, especially its articulations and interactions with surrounding rural communities, infrastructure and natural resources. In this multi-disciplinary project, we apply text analytical tools to digitised Aberdeen Burgh Records, which are a UNESCO listed cultural artifact. The meaningful content of the Records is linguistically obscured, so must be interpreted. Moreover, to extract and reuse the content with Semantic Web and linked data technologies, it must be machine readable and richly annotated. To accomplish this, we develop a text analytic tool that specifically relates to the language, content, and structure of the Records. The result is an accessible, flexible, and essential precursor to the development of Semantic Web and linked data applications related to the Records. The applications will exploit the artifact to promote Aberdeen Burgh and Shire cultural tourism, curriculum development, and scholarship.

The scholarly objective of this project is to develop the analytic framework, methods, and resource materials to apply a text analytic tool to annotate and access the content of the Burgh records. Amongst the text analytic issues to address in historical perspective are: the identification and analysis of legal entities, events, and roles; and the analysis of legal argumentation and reasoning. Amongst the legal historical issues are: the political and legal culture and authority in the Burgh and Shire, particularly pertaining to the management and use of natural resources. Having an understanding of these issues and being able to access them using Semantic Web/linked data technologies will then facilitate exploitation in applications.

This project complements a distinct, existing collaboration between the Aberdeen City & Aberdeenshire Archives (ACAA) and the University (Connecting and Projecting Aberdeen’s Burgh Records, jointly led by Andrew Mackillop and Jackson Armstrong) (the RIISS Project), which will both make a contribution to the project (see details on application form). This multi-disciplinary application seeks funding from Dot.Rural chiefly for the time of two specialist researchers: a Research Fellow to interpret the multiple languages, handwriting scripts, archaic conventions, and conceptual categories emerging from these records; and subcontracting the A-I to carry out the text analytic and linked data tasks on a given corpus of previously transcribed council records, taking the RF’s interpretation as input.

Now there’s a project for tracking changing semantics over the hills and valleys of time!

Will be interesting to see how they capture semantics that are alien to our own.

Or how they preserve relationships between ancient semantic concepts.

44 million stars and counting: …

Sunday, December 1st, 2013

44 million stars and counting: Astronomers play Snap and remap the sky

From the post:

Tens of millions of stars and galaxies, among them hundreds of thousands that are unexpectedly fading or brightening, have been catalogued properly for the first time.

Professor Bryan Gaensler, Director of the ARC Centre of Excellence for All-sky Astrophysics (CAASTRO) based in the School of Physics at the University of Sydney, Australia, and Dr Greg Madsen at the University of Cambridge, undertook this formidable challenge by combining photographic and digital data from two major astronomical surveys of the sky, separated by sixty years.

The new precision catalogue has just been published in The Astrophysical Journal Supplement Series. It represents one of the most comprehensive and accurate compilations of stars and galaxies ever produced, covering 35 percent of the sky and using data going back as far as 1949.

Professor Gaensler and Dr Madsen began by re-examining a collection of 7400 old photographic plates, which had previously been combined by the US Naval Observatory into a catalogue of more than one billion stars and galaxies.

The researchers are making their entire catalogue public on the WWW, in the lead-up to the next generation of telescopes designed to search for changes in the night sky, such as the Panoramic Survey Telescope and Rapid Response System in Hawaii and the SkyMapper telescope in Australia. (unlike the Astrophysical Journal article referenced above)

Now there’s a big data project!

Because of the time period for comparison, the investigators found variations in star brightness that would have otherwise gone undetected.

Will your data be usable in sixty (60) years?

The curse of NOARK

Tuesday, November 26th, 2013

The curse of NOARK by Lars Marius Garshol.

From the post:

I’m writing about a phenomenon that’s specifically Norwegian, but some things are easier to explain to foreigners, because we Norwegians have been conditioned to accept them. In this case I’m referring to the state of the art for archiving software in the Norwegian public sector, where everything revolves around the standard known as NOARK.

Let’s start with the beginning. Scandinavian chancelleries have a centuries-long tradition for a specific approach to archiving, which could be described as a kind of correspondence journal. Essentially, all incoming and outgoing mail, as well as important internal documents, were logged in a journal, with title, from, to, and date for each document. In addition, each document would be filed under a “sak”, which translates roughly as “case” or “matter under consideration”. Effectively, it’s a kind of tag which ties together one thread of documents relating to a specific matter.

The classic example is if the government receives a request of some sort, then produces some intermediate documents while processing it, and then sends a response. Perhaps there may even be couple of rounds of back-and-forth with the external party. This would be an archetypal “sak” (from now on referred to as “case”), and you can see how having all these documents in a single case file would be absolutely necessary for anyone responding to the case. In fact, it’s not dissimilar to the concept of an issue in an issue-tracking system.

In this post and its continuation in Archive web services: a missed opportunity Lars details the shortcomings of the NOARK standard.

To some degree specifically Norwegian but the problem of poor IT design is truly an international phenomena.

I haven’t made any suggestions since the U.S. is home to the virtual case management debacle, the incredible melting NSA data center, not to mention the non-functional health care IT system known as

Read these posts by Lars because you will encounter projects before mistakes similar to the ones Lars describes have been set in stone.

No guarantees of success but instead of providing semantic data management on top of a broken IT system, you could be providing semantic data management on top of a non-broken IT system.

Perhaps never a great IT system but I would settle for a non-broken one any day.

Hot Topics: The DuraSpace Community Webinar Series

Thursday, November 7th, 2013

Hot Topics: The DuraSpace Community Webinar Series

From the DuraSpace about page:

DuraSpace supported open technology projects provide long-term, durable access to and discovery of digital assets. We put together global, strategic collaborations to sustain DSpace and Fedora, two of the most widely-used repository solutions in the world. More than fifteen hundred institutions use and help develop these open source software repository platforms. DSpace and Fedora are directly supported with in-kind contributions of development resources and financial donations through the DuraSpace community sponsorship program.

Like most of you, I’m familiar with DSpace andFedora but I wasn’t familiar with the “Hot Topics” webinar series. I was following a link from Recommended! “Metadata and Repository Services for Research Data Curation” Webinar by Imma Subirats, when I encountered the “Hot Topics” page.

  • Series Six: Research Data in Repositories
  • Series Five: VIVO–Research Discovery and Networking
  • Series Four: Research Data Management Support
  • Series Three: Get a Head on Your Repository with Hydra End-to-End Solutions
  • Series Two: Managing and Preserving Audio and Video in your Digital Repository
  • Series One: Knowledge Futures: Digital Preservation Planning

Each series consists of three (3) webinars, all with recordings, most with slides as well.

Warning: Data curation doesn’t focus on the latest and coolest GPU processing techniques.

But, in ten to fifteen years when GPU techniques are like COBOL is now, good data curation will enable future students to access those techniques.

I think that is worthwhile.


The Shelley-Godwin Archive

Tuesday, November 5th, 2013

The Shelley-Godwin Archive

From the homepage:

The Shelley-Godwin Archive will provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online for the first time ever the widely dispersed handwritten legacy of this uniquely gifted family of writers. The result of a partnership between the New York Public Library and the Maryland Institute for Technology in the Humanities, in cooperation with Oxford’s Bodleian Library, the S-GA also includes key contributions from the Huntington Library, the British Library, and the Houghton Library. In total, these partner libraries contain over 90% of all known relevant manuscripts.

In case you don’t recognize the name, Mary Shelley wrote Frankenstein; or, The Modern Prometheus; William Godwin, philosopher, early modern (unfortunately theoretical) anarchist; Percy Bysshe Shelley, English Romantic Poet; Mary Wollstonescraft, writer, feminist. Quite a group for the time or even now.

From the About page on Technological Infrastructure:

The technical infrastructure of the Shelley-Godwin Archive builds on linked data principles and emerging standards such as the Shared Canvas data model and the Text Encoding Initiative’s Genetic Editions vocabulary. It is designed to support a participatory platform where scholars, students, and the general public will be able to engage in the curation and annotation of the Archive’s contents.

The Archive’s transcriptions and software applications and libraries are currently published on GitHub, a popular commercial host for projects that use the Git version control system.

  • TEI transcriptions and other data
  • Shared Canvas viewer and search service
  • Shared Canvas manifest generation

All content and code in these repositories is available under open licenses (the Apache License, Version 2.0 and the Creative Commons Attribution license). Please see the licensing information in each individual repository for additional details.

Shared Canvas and Linked Open Data

Shared Canvas is a new data model designed to facilitate the description and presentation of physical artifacts—usually textual—in the emerging linked open data ecosystem. The model is based on the concept of annotation, which it uses both to associate media files with an abstract canvas representing an artifact, and to enable anyone on the web to describe, discuss, and reuse suitably licensed archival materials and digital facsimile editions. By allowing visitors to create connections to secondary scholarship, social media, or even scenes in movies, projects built on Shared Canvas attempt to break down the walls that have traditionally enclosed digital archives and editions.

Linked open data or content is published and licensed so that “anyone is free to use, reuse, and redistribute it—subject only, at most, to the requirement to attribute and/or share-alike,” (from with the additional requirement that when an entity such as a person, a place, or thing that has a recognizable identity is referenced in the data, the reference is made using a well-known identifier—called a universal resource identifier, or “URI”—that can be shared between projects. Together, the linking and openness allow conformant sets of data to be combined into new data sets that work together, allowing anyone to publish their own data as an augmentation of an existing published data set without requiring extensive reformulation of the information before it can be used by anyone else.

The Shared Canvas data model was developed within the context of the study of medieval manuscripts to provide a way for all of the representations of a manuscript to co-exist in an openly addressable and shareable form. A relatively well-known example of this is the tenth-century Archimedes Palimpsest. Each of the pages in the palimpsest was imaged using a number of different wavelengths of light to bring out different characteristics of the parchment and ink. For example, some inks are visible under one set of wavelengths while other inks are visible under a different set. Because the original writing and the newer writing in the palimpsest used different inks, the images made using different wavelengths allow the scholar to see each ink without having to consciously ignore the other ink. In some cases, the ink has faded so much that it is no longer visible to the naked eye. The Shared Canvas data model brings together all of these different images of a single page by considering each image to be an annotation about the page instead of a surrogate for the page. The Shared Canvas website has a viewer that demonstrates how the imaging wavelengths can be selected for a page.

One important bit, at least for topic maps, is the view of the Shared Canvas data model that:

each image [is considered] to be an annotation about the page instead of a surrogate for the page.

If I tried to say that or even re-say it, it would be much more obscure. 😉

Whether “annotation about” versus “surrogate for” will catch on beyond manuscript studies it’s hard to say.

Not the way it is usually said in topic maps but if other terminology is better understood, why not?