How to analyze 100 million images for $624 by Pete Warden.
From the post:
Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.
Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.
I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)
There are a couple of other components that are necessary to reach the same results as Pete.
Seek HIPI for processing photos on Hadoop and OpenCV and the rest of Pete’s article for some very helpful tips.