Web Data Commons: Extracting Structured Data from the Common Web Crawl
From the post:
Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, …).
We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.
An interesting mining of open data.
The ability to perform comparisons on data over time is particularly interesting.