Crowdsourcing + Machine Learning: Nicholas Woodward at TCDL by Ben W. Brumfield.
I was so impressed by Nicholas Woodward’s presentation at TCDL this year that I asked him if I could share “Crowdsourcing + Machine Learning: Building an Application to Convert Scanned Documents to Text” on this blog.
Hi. My name is Nicholas Woodward, and I am a Software Developer for the University of Texas Libraries. Ben Brumfield has been so kind as to offer me an opportunity to write a guest post on his blog about my approach for transcribing large scanned document collections that combines crowdsourcing and computer vision. I presented my application at the Texas Conference on Digital Libraries on May 7th, 2013, and the slides from the presentation are available on TCDL’s website. This purpose of this post is to introduce my approach along with a test collection and preliminary results. I’ll conclude with a discussion on potential avenues for future work.
Before we delve into algorithms for computer vision and what-not, I’d first like to say a word about the collection used in this project and why I think it’s important to look for new ways to complement crowdsourcing transcription. The Guatemalan National Police Historical Archive (or AHPN, in Spanish) contains the records of the Guatemalan National Police from 1882-2005. It is estimated that AHPN contains more than 80 million pages of documents (8,000 linear meters) such as handwritten journals and ledgers, birth certificate and marriage license forms, identification cards and typewritten letters. To date, the AHPN staff have processed and digitized approximately 14 million pages of the collection, and they are publicly available in a digital repository that was developed by UT Libraries.
While unique for its size, AHPN is representative of an increasingly common problem in the humanities and social sciences. The nature of the original documents precludes any economical OCR solution on the scanned images (See below), and the immense size of the collection makes page-by-page transcription highly impractical, even when using a crowdsourcing approach. Additionally, the collection does not contain sufficient metadata to support browsing via commonly used traits, such as titles or authors of documents.
A post at the intersection of many of my interests!
Imagine pushing this just a tad further to incorporate management of subject identity, whether visible to the user or not.
[…] posting on Crowdsourcing + Machine Learning… reminded me to check on access to the archives of the International Tracking […]
Pingback by International Tracing Service Archive « Another Word For It — June 5, 2013 @ 10:45 am