Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 21, 2013

A Look Inside Our 210TB 2012 Web Crawl

Filed under: Common Crawl,Search Data — Patrick Durusau @ 5:13 pm

A Look Inside Our 210TB 2012 Web Crawl by Lisa Green.

From the post:

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!

Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.

From the conclusion section of the paper:

The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets.

View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://aws-publicdatasets/common-crawl/index2012 and the code on GitHub.

Don’t have your own server farm crawling the internet?

Take a long look at CommonCrawl and their publicly accessible crawl data.

If the enterprise search bar is at 9%, the Internet search bar is even lower.

Use CommonCrawl data as a practice field.

Does your first ten “hits” include old data because it is popular?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress