Archive for the ‘Common Crawl’ Category

Crawling the WWW – A $64 Question

Saturday, January 24th, 2015

Have you ever wanted to crawl the WWW? To make a really comprehensive search? Waiting for a private power facility and server farm? You need wait no longer!

Ross Fairbanks details in WikiReverse data pipeline details the creation of Wikireverse:

WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites.

You can browse the data at WikiReverse or downloaded from S3 as a torrent.

The first thought that struck me was the data set would be useful for deciding which Wikipedia links are the default subject identifiers for particular subjects.

My second thought was what a wonderful starting place to find links with similar content strings, for the creation of topics with multiple subject identifiers.

My third thought was, $64 to search a CommonCrawl data set!

You can do a lot of searches at $64 per before you get to the cost of a server farm, much less a server farm plus a private power facility.

True, it won’t be interactive but then few searches at the NSA are probably interactive. 😉

The true upside being you are freed from the tyranny of page-rank and hidden algorithms by which vendors attempt to guess what is best for them and secondarily, what is best for you.

Take the time to work through Ross’ post and develop your skills with the CommonCrawl data.

Wouldn’t it be fun to build your own Google?

Thursday, December 11th, 2014

Wouldn’t it be fun to build your own Google? by Martin Kleppmann.

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

October 2014 Crawl Archive Available

Friday, November 21st, 2014

October 2014 Crawl Archive Available by Stephen Merity.

From the post:

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Just in time for weekend exploration! 😉


August 2014 Crawl Data Available

Monday, October 20th, 2014

August 2014 Crawl Data Available by Stephen Merity.

From the post:

The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Have you considered diffing the same webpages from different crawls?

Just curious. Could be empirical evidence of which websites are stable and those were content could change from under you.

Web Data Commons Extraction Framework …

Sunday, August 31st, 2014

Web Data Commons Extraction Framework for the Distributed Processing of CC Data by Robert Meusel.

Interested in a framework to process all the Common Crawl data?

From the post:

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed). Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

NSA level processing it’s not but then you are most likely looking for useful results, not data for the sake of filling up drives.

July 2014 Crawl Data Available [Honeypot Detection]

Wednesday, August 13th, 2014

July 2014 Crawl Data Available by Stephen Merity.

From the post:

The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 4.05 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-23/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

We’ve also released a Python library, gzipstream, that should enable easier access and processing of the Common Crawl dataset. We’d love for you to try it out!

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Just in case you have exhausted all the possibilities with the April Crawl Data. 😉

Comparing the two crawls:

April – 183TB in size containing approximately 2.6 billion webpages

July – 266TB in size containing approximately 4.05 billion webpages

Just me but I would say there is new material in the July crawl.

The additional content could be CIA, FBI, NSA honeypots or broken firewalls but I rather doubt it.

Curious, how would you detect a honeypot from a crawl data? Thinking a daily honeypot report could be a viable product for some market segment.

April 2014 Crawl Data Available

Thursday, July 17th, 2014

April 2014 Crawl Data Available by Stephen Merity.

From the post:

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to Blekko for their ongoing donation of URLs for our crawl!

Well, at 183TB, I don’t guess I am going to have a local copy. 😉


Navigating the WARC File Format

Friday, April 11th, 2014

Navigating the WARC File Format by Stephen Merity.

From the post:

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

If you aren’t already using Common Crawl data, you should be.

Fresh Data Available:

The latest dataset is from March 2014, contains approximately 2.8 billion webpages and is located
in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-10.

What are you going to look for in 2.8 billion webpages?

Winter 2013 Crawl Data Now Available

Saturday, January 11th, 2014

Winter 2013 Crawl Data Now Available by Lisa Green.

From the post:

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size. The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2013-48/

In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl – stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful for blekko’s ongoing support!

In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.

Data to play with now and the promise of more to come! Can’t argue with that!

Learning more about Common Crawl’s use of Nutch will be fun as well.

Small Crawl

Tuesday, January 7th, 2014

meanpath Jan 2014 Torrent – 1.6TB of crawl data from 115m websites

From the post:

October 2012 was the official kick off date for development of meanpath – our source code search engine. Our goal was to crawl as much of the web as we could using mostly open source software and a decent (although not Google level) financial investment. Outside of many substantial technical challenges, we also needed to acquire a sizeable list of seed domains as the starting block for our crawler. Enter Common Crawl which is an open crawl of the web that can be accessed and analysed by everyone. Of specific interest to us was the Common Crawl URL Index which we combined with raw domain zone files and domains from the Internet Census 2012 to create our master domain list.

We are firm supporters of open access to information which is why we have chosen to release a free crawl of over 115 million sites. This index contains only the front page HTML, robots.txt, favicons, and server headers of every crawlable .com, .net, .org, .biz, .info, .us, .mobi, and .xxx that were in the 2nd of January 2014 zone file. It does not execute or follow JavaScript or CSS so is not 100% equivalent to what you see when you click on view source in your browser. The crawl itself started at 2:00am UTC 4th of January 2014 and finished the same day.

Get Started:
You can access the meanpath January 2014 Front Page Index in two ways:

  1. Bittorrent – We have set up a number of seeds that you can download from using this descriptor. Please seed if you can afford the bandwidth and make sure you have 1.6TB of disk space free if you plan on downloading the whole crawl.
  2. Web front end – If you are not interested in grappling with the raw crawl files you can use our web front end to do some sample searches.

Data Set Statistics:

  1. 149,369,860 seed domains. We started our crawl with a full zone file list of all domains in the .com (112,117,307), .net (15,226,877), .org (10,396,351), .info (5,884,505), .us (1,804,653), .biz (2,630,676), .mobi (1,197,682) and .xxx (111,809) top level domains (TLD) for a total of 149,369,860 domains. We have a much larger set of domains that cover all TLDs but very few allow you to download a zone file from the registrar so we cannot guarantee 100% coverage. For statistical purposes having a defined 100% starting point is necessary.
  2. 115,642,924 successfully crawled domains. Of the 149,369,860 domains only 115,642,924 were able to be crawled which is a coverage rate of 77.42%
  3. 476 minutes of crawling. It took us a total of 476 minutes to complete the crawl which was done in 5 passes. If a domain could not be crawled in the first pass we tried 4 more passes before giving up (those excluded by robots.txt are not retried). The most common reason domains are not able to be crawled is a lack of any valid A record for or
  4. 1,500GB of uncompressed data. This has been compressed down to 352.40gb using gzip for ease of download.

I just scanned the Net for 2TB hard drives and the average runs between $80 and $100. There doesn’t seem to be much difference between internal and external.

The only issue I foresee is that some ISPs limit downloads. You can always tunnel to another box using SSH but that requires enough storage on the other box as well.

Be sure to check out meanpath’s search capabilities.

Perhaps the day of boutique search engines is getting closer!

2013 Arrives! (New Crawl Data)

Thursday, November 28th, 2013

New Crawl Data Available! by Jordan Mendelson.

From the post:

We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).

We’ve made some changes to the data formats and the directory structure. Please see the details below and please share your thoughts and questions on the Common Crawl Google Group.

Format Changes

We have switched from ARC files to WARC files to better match what the industry has standardized on. WARC files allow us to include HTTP request information in the crawl data, add metadata about requests, and cross-reference the text extracts with the specific response that they were generated from. There are also many good open source tools for working with WARC files.

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.

We have switched our text file format from Hadoop sequence files to WET files (WARC Encapsulated Text) that properly reference the original requests. This makes it far easier for your processes to disambiguate which text extracts belong to which specific page fetches.

Jordan continues to outline the directory structure of the 2013 crawl data and lists additional resources that will be of interest.

If you aren’t Google or some reasonable facsimile thereof (yet), the Common Crawl data set is your doorway into the wild wild content of the WWW.

How do your algorithms fare when matched against the full range of human expression?

Hyperlink Graph

Saturday, November 16th, 2013

Web Data Commons – Hyperlink Graph by Robert Meusel, Oliver Lehmberg and Christian Bizer.

From the post:

This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.

We hope that the graph will be useful for researchers who develop

  • search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

This is great news!

Competing graph engines won’t need to create synthetic data to gauge their scalability/performance.

Looking forward to news of results measured against this data set.

Kudos to the Web Data Commons and Robert Meusel, Oliver Lehmberg and Christian Bizer.

A Look Inside Our 210TB 2012 Web Crawl

Wednesday, August 21st, 2013

A Look Inside Our 210TB 2012 Web Crawl by Lisa Green.

From the post:

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!

Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.

From the conclusion section of the paper:

The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from, blog publishing services like and as well as online shopping sites such as These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets.

View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://aws-publicdatasets/common-crawl/index2012 and the code on GitHub.

Don’t have your own server farm crawling the internet?

Take a long look at CommonCrawl and their publicly accessible crawl data.

If the enterprise search bar is at 9%, the Internet search bar is even lower.

Use CommonCrawl data as a practice field.

Does your first ten “hits” include old data because it is popular?

Blogging about Bloggers

Wednesday, May 1st, 2013

Extracting topics of interests using blog data in Amazon’s Common-Crawl corpus

From the post:

This project aims at profiling blogger interests correlated with their demographics. Amazon’s Common-Crawl corpus was used for this purpose. The crawled data corresponding to the blogger profile web-pages(Sample page) was used as the dataset for this analysis.

The selective download of the required dataset was made possible by the Common Crawl URL Index by Scott Robertson. About 8000 blogger profile pages(surprisingly low!) were found in the corpus using the URL index. Part of the reason for this low number is that the URL index at this time has been generated only for the half of 81TB amazon corpus.

Check out the project at GitHub

If you don’t know already, Common Crawl has a URL index to ease your use of the data set.

URL Search Tool!

Wednesday, March 6th, 2013

URL Search Tool! by Lisa Green.

From the post:

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

Imagine that.

Searching relevant data instead of “big data.”

What a concept!

…NCSU Library URLs in the Common Crawl Index

Sunday, January 20th, 2013

Analysis of the NCSU Library URLs in the Common Crawl Index by Lisa Green.

From the post:

Last week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.

Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!

A great starting point for using the Common Crawl Index!

Common Crawl URL Index

Thursday, January 10th, 2013

Common Crawl URL Index by Lisa Green.

From the post:

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the site to learn more about how they help groups solve big data problems.

From Scott’s post:

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

What research project would you want to do first?

blekko donates search data to Common Crawl [uncommon knowledge graphs]

Monday, December 17th, 2012

blekko donates search data to Common Crawl by Lisa Green.

From the post:

I am very excited to announce that blekko is donating search data to Common Crawl!

blekko was founded in 2007 to pursue innovations that would eliminate spam in search results. blekko has created a new type of search experience that enlists human editors in its efforts to eliminate spam and personalize search. blekko has raised $55 million in VC and currently has 48 employees, including former Google and Yahoo! Search engineers.

For details of their donation and collaboration with Common Crawl see the post from their blog below. Follow blekko on Twitter and subscribe to their blog to keep abreast of their news (lots of cool stuff going on over there!) and be sure to check out there search.

And from blekko:

At blekko, we believe the web and search should be open and transparent — it’s number one in the blekko Bill of Rights. To make web data accessible, blekko gives away our search results to innovative applications using our API. Today, we’re happy to announce the ongoing donation of our search engine ranking metadata for 140 million websites and 22 billion webpages to the Common Crawl Foundation.

That’s a fair sized chunk of metadata.

The advantage of having large scale crawling and storage capabilities is slowly fading.

Are you ready to take the next step beyond tweaking the same approach?

Yes, Google has the Knowledge Graph. Which is no mean achievement.

On the other hand, aren’t most enterprises interested in uncommon knowledge graphs? As in their knowledge graph?

The difference between a freebie calculator and a Sun workstation.

Which one do you want?

Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]

Friday, December 14th, 2012

Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!


The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

The numbers:

Extraction Statistics

Crawl Date January-June 2012
Total Data 40.1 Terabyte (compressed)
Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196
Domains in Crawl 40,600,000
Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples 7,350,953,995

See also:

Web Data Commons Extraction Report – August 2012 Corpus


Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus

Where the authors report:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.

PLDs = Pay-Level-Domains.

The use of Microformats on “2.5 times as many websites as RDFa and Microdata together” has to make you wonder about the viability of RDFa.

Or to put it differently, if RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008), is it time to look for an alternative?

I first saw the news about the new Web Data Commons data drop in a tweet by Tobias Trapp.

Towards Social Discovery…

Sunday, November 4th, 2012

Towards Social Discovery – New Content Models; New Data; New Toolsets by Matthew Berk, Founder of Lucky Oyster.

From the post:

When I first came across the field of information retrieval in the 80′s and early 90′s (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90′s and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.

Today we’re at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:

No, I won’t even summarize his three points. It’s short and quite well written.

Read his post and then consider: Where do topic maps fit into his “crossroads?”

A Common Crawl Experiment

Saturday, October 27th, 2012

A Common Crawl Experiment by Pavel Repin.

An introduction to the Common Crawl project.

Starts you off slow, with 4 billion pages. 😉

You are limited only by your imagination.

I first saw this at: Pete Warden’s Five Short Links.

finding names in common crawl

Sunday, August 19th, 2012

finding names in common crawl by Mat Kelcey.

From the post:

the central offering from common crawl is the raw bytes they’ve downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they’ve done this extraction as a part of post processing the crawl and it’s freely available too!

If you don’t know “common crawl,” now would be a good time to meet the project.

From their webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

Mat gets you started by looking for names in the common crawl data set.

Learn Hadoop and get a paper published

Thursday, May 10th, 2012

Learn Hadoop and get a paper published by Allison Domicone.

From the post:

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.

How very cool!

Hurry, there are nineteen (19) comments already!

A Twelve Step Program for Searching the Internet

Sunday, March 25th, 2012

OK, the real title is: Twelve steps to running your Ruby code across five billion web pages

From the post:

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

A call to action and an awesome post!

If you have ever forwarded a blog post, forward this one.

This would make a great short course topic. Will have to give that some thought.

Web Data Commons

Thursday, March 22nd, 2012

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!

Common Crawl To Add New Data In Amazon Web Services Bucket

Tuesday, March 13th, 2012

Common Crawl To Add New Data In Amazon Web Services Bucket

From the post:

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

That’s good news!

At least I think so.

I am sure like everyone else, I will be trying to find the cycles (or at least thinking about it) to play (sorry, explore) the Common Crawl data set.

I hesitate to say without reservation this is a good thing because my data needs are more modest than searching the entire WWW.

That wasn’t so hard to say. Hurt a little but not that much. 😉

I am exploring how to get better focus on information resources of interest to me. I rather doubt that focus is going to start with the entire WWW as an information space. Will keep you posted.

running mahout collocations over common crawl text

Tuesday, March 6th, 2012

running mahout collocations over common crawl text by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Can you answer Mat’s question about the incidence of Lithuanian pages? (Please post here.)

Web Data Commons

Tuesday, January 24th, 2012

Web Data Commons: Extracting Structured Data from the Common Web Crawl

From the post:

Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, …).

We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.

An interesting mining of open data.

The ability to perform comparisons on data over time is particularly interesting.

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Thursday, January 12th, 2012

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

From the post:

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Any takers?

It will be this weekend but I will be reporting back next Monday.

Common Crawl

Wednesday, November 30th, 2011

Common Crawl

From the webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data. For access to our open source code, please see our GitHub repository.

Current crawl is reported to be 5 billion pages. That should keep you hard drives spinning enough to help with heating in cold climes!

Looks like a nice place to learn a good bit about searching as well as processing serious sized data.