Data Mining the Internet Archive Collection by Caleb McDaniel.
From the “Lesson Goals:”
The collections of the Internet Archive (IA) include many digitized sources of interest to historians, including early JSTOR journal content, John Adams’s personal library, and the Haiti collection at the John Carter Brown Library. In short, to quote Programming Historian Ian Milligan, “The Internet Archive rocks.”
In this lesson, you’ll learn how to download files from such collections using a Python module specifically designed for the Internet Archive. You will also learn how to use another Python module designed for parsing MARC XML records, a widely used standard for formatting bibliographic metadata.
For demonstration purposes, this lesson will focus on working with the digitized version of the Anti-Slavery Collection at the Boston Public Library in Copley Square. We will first download a large collection of MARC records from this collection, and then use Python to retrieve and analyze bibliographic information about items in the collection. For example, by the end of this lesson, you will be able to create a list of every named place from which a letter in the antislavery collection was written, which you could then use for a mapping project or some other kind of analysis.
In particular for librarians and library students who will already be familiar with MARC records.
Some 7,000 items from the Boston Public Library’s anti-slavery collection at Copley Square are the focus of this lesson.
That means historians have access to rich metadata, full images, and partial descriptions for thousands of antislavery letters, manuscripts, and publications.
Would original anti-slavery materials, written by actual participants, have interested you as a student? Do you think such materials would interest students now?
I first saw this in a tweet by Gregory Piatetsky.