Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!
The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket
aws-publicdatasetsunder the key prefix
Crawl Date January-June 2012 Total Data 40.1 Terabyte (compressed) Parsed HTML URLs 3,005,629,093 URLs with Triples 369,254,196 Domains in Crawl 40,600,000 Domains with Triples 2,286,277 Typed Entities 1,811,471,956 Triples 7,350,953,995
Where the authors report:
Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.
PLDs = Pay-Level-Domains.
The use of Microformats on “2.5 times as many websites as RDFa and Microdata together” has to make you wonder about the viability of RDFa.
Or to put it differently, if RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008), is it time to look for an alternative?
I first saw the news about the new Web Data Commons data drop in a tweet by Tobias Trapp.