Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 7, 2014

Small Crawl

Filed under: Common Crawl,Data,Webcrawler,WWW — Patrick Durusau @ 7:40 pm

meanpath Jan 2014 Torrent – 1.6TB of crawl data from 115m websites

From the post:

October 2012 was the official kick off date for development of meanpath – our source code search engine. Our goal was to crawl as much of the web as we could using mostly open source software and a decent (although not Google level) financial investment. Outside of many substantial technical challenges, we also needed to acquire a sizeable list of seed domains as the starting block for our crawler. Enter Common Crawl which is an open crawl of the web that can be accessed and analysed by everyone. Of specific interest to us was the Common Crawl URL Index which we combined with raw domain zone files and domains from the Internet Census 2012 to create our master domain list.

We are firm supporters of open access to information which is why we have chosen to release a free crawl of over 115 million sites. This index contains only the front page HTML, robots.txt, favicons, and server headers of every crawlable .com, .net, .org, .biz, .info, .us, .mobi, and .xxx that were in the 2nd of January 2014 zone file. It does not execute or follow JavaScript or CSS so is not 100% equivalent to what you see when you click on view source in your browser. The crawl itself started at 2:00am UTC 4th of January 2014 and finished the same day.

Get Started:
You can access the meanpath January 2014 Front Page Index in two ways:

  1. Bittorrent – We have set up a number of seeds that you can download from using this descriptor. Please seed if you can afford the bandwidth and make sure you have 1.6TB of disk space free if you plan on downloading the whole crawl.
  2. Web front end – If you are not interested in grappling with the raw crawl files you can use our web front end to do some sample searches.

Data Set Statistics:

  1. 149,369,860 seed domains. We started our crawl with a full zone file list of all domains in the .com (112,117,307), .net (15,226,877), .org (10,396,351), .info (5,884,505), .us (1,804,653), .biz (2,630,676), .mobi (1,197,682) and .xxx (111,809) top level domains (TLD) for a total of 149,369,860 domains. We have a much larger set of domains that cover all TLDs but very few allow you to download a zone file from the registrar so we cannot guarantee 100% coverage. For statistical purposes having a defined 100% starting point is necessary.
  2. 115,642,924 successfully crawled domains. Of the 149,369,860 domains only 115,642,924 were able to be crawled which is a coverage rate of 77.42%
  3. 476 minutes of crawling. It took us a total of 476 minutes to complete the crawl which was done in 5 passes. If a domain could not be crawled in the first pass we tried 4 more passes before giving up (those excluded by robots.txt are not retried). The most common reason domains are not able to be crawled is a lack of any valid A record for domain.com or www.domain.com
  4. 1,500GB of uncompressed data. This has been compressed down to 352.40gb using gzip for ease of download.

I just scanned the Net for 2TB hard drives and the average runs between $80 and $100. There doesn’t seem to be much difference between internal and external.

The only issue I foresee is that some ISPs limit downloads. You can always tunnel to another box using SSH but that requires enough storage on the other box as well.

Be sure to check out meanpath’s search capabilities.

Perhaps the day of boutique search engines is getting closer!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress