Extracting images from scanned book pages by Chris Adams.
From the post:
I work on a project which has placed a number of books online. Over the years we’ve improved server performance and worked on a fast, responsive viewer for scanned books to make our books as accessible as possible but it’s still challenging to help visitors find something of interest out of hundreds of thousands of scanned pages.
Trevor and I have discussed various ways to improve the situation and one idea which seemed promising was seeing how hard it would be to extract the images from digitized pages so we could present a visual index of an item. Trevor’s THATCamp CHNM post on Freeing Images from Inside Digitized Books and Newspapers got a favorable reception and since it kept coming up at work I decided to see how far I could get using OpenCV.
Everything you see below is open-source and comments are highly welcome. I created a book-illustration-detection branch in my image mining project (see my previous experiment reconstructing higher-resolution thumbnails from the masters) so feel free to fork it or open issues.
Just in case you are looking for a Fall project. 😉
Consider capturing the images and their contents in associations with authors, publishers, etc. To enable mining those associations for patterns.