Have you ever wanted to crawl the WWW? To make a really comprehensive search? Waiting for a private power facility and server farm? You need wait no longer!
Ross Fairbanks details in WikiReverse data pipeline details the creation of Wikireverse:
WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites.
You can browse the data at WikiReverse or downloaded from S3 as a torrent.
The first thought that struck me was the data set would be useful for deciding which Wikipedia links are the default subject identifiers for particular subjects.
My second thought was what a wonderful starting place to find links with similar content strings, for the creation of topics with multiple subject identifiers.
My third thought was, $64 to search a CommonCrawl data set!
You can do a lot of searches at $64 per before you get to the cost of a server farm, much less a server farm plus a private power facility.
True, it won’t be interactive but then few searches at the NSA are probably interactive. đ
The true upside being you are freed from the tyranny of page-rank and hidden algorithms by which vendors attempt to guess what is best for them and secondarily, what is best for you.
Take the time to work through Ross’ post and develop your skills with the CommonCrawl data.