Wikipedia Usage Statistics by Paul Houle.
From the post:
The Wikimedia Foundation publishes page view statistics for Wikimedia projects here; this serveris rate-limited so it took roughly a month to transfer this 4 TB data set into S3 Storage in the AWS cloud. The photo on the left is of a hard drive containing a copy of the data that was produced with AWS Import/Export.
Once in S3, it is easy to process this data with Amazon Map/Reduce using the Open Source telepath software.
The first product developed from this is SubjectiveEye3D.
It’s your turn
Future projects require that this data be integrated with semantic data from :BaseKB and that has me working on tools such as RDFeasy. In the meantime, a mirror of the Wikipedia pagecounts from Jan 2008 to Feb 2014 is available in a requester pays bucket in S3 , which means you can use it in the Amazon Cloud for free and download data elsewhere for the cost of bulk network transfer.
Interesting isn’t it?
That “open” data can be so difficult to obtain and manipulate that it may as well not be “open” at all for the average user.
Something to keep in mind when big players talk about privacy. Do they mean private from their prying eyes or yours?
I think you will find in most cases that “privacy” means private from you and not the big players.
If you want to do a good deed for this week, support this data set at Gittip.
I first saw this in a tweet by Gregory Piatetsky.