Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 30, 2011

Common Crawl

Filed under: Common Crawl,Search Algorithms,Search Data,Searching — Patrick Durusau @ 8:13 pm

Common Crawl

From the webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data. For access to our open source code, please see our GitHub repository.

Current crawl is reported to be 5 billion pages. That should keep you hard drives spinning enough to help with heating in cold climes!

Looks like a nice place to learn a good bit about searching as well as processing serious sized data.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress