Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 2, 2016

How to Build a TimesMachine [New York Times from 1851-2001]

Filed under: History,News,Search Engines — Patrick Durusau @ 1:59 pm

How to Build a TimesMachine by Jane Cotler and Evan Sandaus.

From the post:

At the beginning of this year, we quietly expanded TimesMachine, our virtual microfilm reader, to include every issue of The New York Times published between 1981 and 2002. Prior to this expansion, TimesMachine contained every issue published between 1851 and 1980, which consisted of over 11 million articles spread out over approximately 2.5 million pages. The new expansion adds an additional 8,035 complete issues containing 1.4 million articles over 1.6 million pages.

the_time_machine

Creating and expanding TimesMachine presented us with several interesting technical challenges, and in this post we’ll describe how we tackled two. First, we’ll discuss the fundamental challenge with TimesMachine: efficiently providing a user with a scan of an entire day’s newspaper without requiring the download of hundreds of megabytes of data. Then, we’ll discuss a fascinating string matching problem we had to solve in order to include articles published after 1980 in TimesMachine.

It’s not all the extant Hebrew Bible witnesses, both images and transcription, or all extant cuneiform tablets with existing secondary literature, but if you are interested in more recent events, what a magnificent resource!

Tesseract-ocr gets a shout-out and link for its use on the New York Times archives.

The string matching solution for search shows the advantages of finding a “nearly perfect” solution.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress