The ClueWeb12 project reports:
The Lemur Project is creating a new web dataset, tentatively called ClueWeb12, that will be a companion or successor to the ClueWeb09 web dataset. This new dataset is expected to be ready for distribution in June 2012. Dataset construction consists of crawling the web for about 1 billion pages, web page filtering, and organization into a research-ready dataset.
The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.
A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service, URLBlacklist.com, which was downloaded on 2012-02-03. The crawler blackliset consists of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. Also included in the blacklist is a small number (currently less than a dozen) of sites that opted out of the crawl.
The crawled web pages will be filtered to remove certain types of pages, for example, pages that a text classifier identifies as non-English, pornography, or spam. The dataset will contain a file that identifies each url that was removed and why it was removed. The web graph will contain all pages visited by the crawler, and will include information about redirected links.
The crawler captures an average of 10-15 million pages (and associated images, etc) per day. Its progress is documented in a daily progess report.
Are there any search engine ads: X billion of pages crawled?