Extracting topics of interests using blog data in Amazon’s Common-Crawl corpus
From the post:
This project aims at profiling blogger interests correlated with their demographics. Amazon’s Common-Crawl corpus was used for this purpose. The crawled data corresponding to the blogger profile web-pages(Sample page) was used as the dataset for this analysis.
The selective download of the required dataset was made possible by the Common Crawl URL Index by Scott Robertson. About 8000 blogger profile pages(surprisingly low!) were found in the corpus using the URL index. Part of the reason for this low number is that the URL index at this time has been generated only for the half of 81TB amazon corpus.
Check out the project at GitHub
If you don’t know already, Common Crawl has a URL index to ease your use of the data set.