Common Crawl To Add New Data In Amazon Web Services Bucket
From the post:
The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.
That’s good news!
At least I think so.
I am sure like everyone else, I will be trying to find the cycles (or at least thinking about it) to play (sorry, explore) the Common Crawl data set.
I hesitate to say without reservation this is a good thing because my data needs are more modest than searching the entire WWW.
That wasn’t so hard to say. Hurt a little but not that much. 😉
I am exploring how to get better focus on information resources of interest to me. I rather doubt that focus is going to start with the entire WWW as an information space. Will keep you posted.