How to Build a TimesMachine by Jane Cotler and Evan Sandaus.
From the post:
At the beginning of this year, we quietly expanded TimesMachine, our virtual microfilm reader, to include every issue of The New York Times published between 1981 and 2002. Prior to this expansion, TimesMachine contained every issue published between 1851 and 1980, which consisted of over 11 million articles spread out over approximately 2.5 million pages. The new expansion adds an additional 8,035 complete issues containing 1.4 million articles over 1.6 million pages.
Creating and expanding TimesMachine presented us with several interesting technical challenges, and in this post we’ll describe how we tackled two. First, we’ll discuss the fundamental challenge with TimesMachine: efficiently providing a user with a scan of an entire day’s newspaper without requiring the download of hundreds of megabytes of data. Then, we’ll discuss a fascinating string matching problem we had to solve in order to include articles published after 1980 in TimesMachine.
…
It’s not all the extant Hebrew Bible witnesses, both images and transcription, or all extant cuneiform tablets with existing secondary literature, but if you are interested in more recent events, what a magnificent resource!
Tesseract-ocr gets a shout-out and link for its use on the New York Times archives.
The string matching solution for search shows the advantages of finding a “nearly perfect” solution.