Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 1, 2013

Blogging about Bloggers

Filed under: Common Crawl — Patrick Durusau @ 4:10 pm

Extracting topics of interests using blog data in Amazon’s Common-Crawl corpus

From the post:

This project aims at profiling blogger interests correlated with their demographics. Amazon’s Common-Crawl corpus was used for this purpose. The crawled data corresponding to the blogger profile web-pages(Sample page) was used as the dataset for this analysis.

The selective download of the required dataset was made possible by the Common Crawl URL Index by Scott Robertson. About 8000 blogger profile pages(surprisingly low!) were found in the corpus using the URL index. Part of the reason for this low number is that the URL index at this time has been generated only for the half of 81TB amazon corpus.

Check out the project at GitHub

If you don’t know already, Common Crawl has a URL index to ease your use of the data set.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress