We Just Ran Twenty-Three Million Queries of the World Bank’s Website – Working Paper 362 by Sarah Dykstra, Benjamin Dykstra, and Justin Sandefur.
Abstract:
Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank’s PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank’s web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at www.cgdev.org/povcalnet.
That’s what I would call large scale web scraping!
Useful model to follow for many sources, such as the U.S. Department of Agriculture. A gold mine of reports, data, statistics, but all broken up for the manual act of reading. Or at least that is a charitable explanation for their current data organization.