Archive for the ‘Common Crawl’ Category

Blogging about Bloggers

Wednesday, May 1st, 2013

Extracting topics of interests using blog data in Amazon’s Common-Crawl corpus

From the post:

This project aims at profiling blogger interests correlated with their demographics. Amazon’s Common-Crawl corpus was used for this purpose. The crawled data corresponding to the blogger profile web-pages(Sample page) was used as the dataset for this analysis.

The selective download of the required dataset was made possible by the Common Crawl URL Index by Scott Robertson. About 8000 blogger profile pages(surprisingly low!) were found in the corpus using the URL index. Part of the reason for this low number is that the URL index at this time has been generated only for the half of 81TB amazon corpus.

Check out the project at GitHub

If you don’t know already, Common Crawl has a URL index to ease your use of the data set.

URL Search Tool!

Wednesday, March 6th, 2013

URL Search Tool! by Lisa Green.

From the post:

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

Imagine that.

Searching relevant data instead of “big data.”

What a concept!

…NCSU Library URLs in the Common Crawl Index

Sunday, January 20th, 2013

Analysis of the NCSU Library URLs in the Common Crawl Index by Lisa Green.

From the post:

Last week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.

Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!

A great starting point for using the Common Crawl Index!

Common Crawl URL Index

Thursday, January 10th, 2013

Common Crawl URL Index by Lisa Green.

From the post:

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

From Scott’s post:

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

What research project would you want to do first?

blekko donates search data to Common Crawl [uncommon knowledge graphs]

Monday, December 17th, 2012

blekko donates search data to Common Crawl by Lisa Green.

From the post:

I am very excited to announce that blekko is donating search data to Common Crawl!

blekko was founded in 2007 to pursue innovations that would eliminate spam in search results. blekko has created a new type of search experience that enlists human editors in its efforts to eliminate spam and personalize search. blekko has raised $55 million in VC and currently has 48 employees, including former Google and Yahoo! Search engineers.

For details of their donation and collaboration with Common Crawl see the post from their blog below. Follow blekko on Twitter and subscribe to their blog to keep abreast of their news (lots of cool stuff going on over there!) and be sure to check out there search.

And from blekko:

At blekko, we believe the web and search should be open and transparent — it’s number one in the blekko Bill of Rights. To make web data accessible, blekko gives away our search results to innovative applications using our API. Today, we’re happy to announce the ongoing donation of our search engine ranking metadata for 140 million websites and 22 billion webpages to the Common Crawl Foundation.

That’s a fair sized chunk of metadata.

The advantage of having large scale crawling and storage capabilities is slowly fading.

Are you ready to take the next step beyond tweaking the same approach?

Yes, Google has the Knowledge Graph. Which is no mean achievement.

On the other hand, aren’t most enterprises interested in uncommon knowledge graphs? As in their knowledge graph?

The difference between a freebie calculator and a Sun workstation.

Which one do you want?

Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]

Friday, December 14th, 2012

Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!

Access:

The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

The numbers:

Extraction Statistics

Crawl Date January-June 2012
Total Data 40.1 Terabyte (compressed)
Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196
Domains in Crawl 40,600,000
Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples 7,350,953,995

See also:

Web Data Commons Extraction Report – August 2012 Corpus

and,

Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus

Where the authors report:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.

PLDs = Pay-Level-Domains.

The use of Microformats on “2.5 times as many websites as RDFa and Microdata together” has to make you wonder about the viability of RDFa.

Or to put it differently, if RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008), is it time to look for an alternative?

I first saw the news about the new Web Data Commons data drop in a tweet by Tobias Trapp.

Towards Social Discovery…

Sunday, November 4th, 2012

Towards Social Discovery – New Content Models; New Data; New Toolsets by Matthew Berk, Founder of Lucky Oyster.

From the post:

When I first came across the field of information retrieval in the 80′s and early 90′s (back when TREC began), vectors were all the rage, and the key units were terms, texts, and corpora. Through the 90′s and with the advent of hypertext and later the explosion of the Web, that metaphor shifted to pages, sites, and links, and approaches like HITS and Page Rank leveraged hyperlinking between documents and sites as key proxies for authority and relevance.

Today we’re at a crossroads, as the nature of the content we seek to leverage through search and discovery has shifted once again, with a specific gravity now defined by entities, structured metadata, and (social) connections. In particular, and based on my work with Common Crawl data specifically, content has shifted in three critical ways:

No, I won’t even summarize his three points. It’s short and quite well written.

Read his post and then consider: Where do topic maps fit into his “crossroads?”

A Common Crawl Experiment

Saturday, October 27th, 2012

A Common Crawl Experiment by Pavel Repin.

An introduction to the Common Crawl project.

Starts you off slow, with 4 billion pages. ;-)

You are limited only by your imagination.

I first saw this at: Pete Warden’s Five Short Links.

finding names in common crawl

Sunday, August 19th, 2012

finding names in common crawl by Mat Kelcey.

From the post:

the central offering from common crawl is the raw bytes they’ve downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they’ve done this extraction as a part of post processing the crawl and it’s freely available too!

If you don’t know “common crawl,” now would be a good time to meet the project.

From their webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

Mat gets you started by looking for names in the common crawl data set.

Learn Hadoop and get a paper published

Thursday, May 10th, 2012

Learn Hadoop and get a paper published by Allison Domicone.

From the post:

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.

How very cool!

Hurry, there are nineteen (19) comments already!

A Twelve Step Program for Searching the Internet

Sunday, March 25th, 2012

OK, the real title is: Twelve steps to running your Ruby code across five billion web pages

From the post:

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

A call to action and an awesome post!

If you have ever forwarded a blog post, forward this one.

This would make a great short course topic. Will have to give that some thought.

Web Data Commons

Thursday, March 22nd, 2012

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!

Common Crawl To Add New Data In Amazon Web Services Bucket

Tuesday, March 13th, 2012

Common Crawl To Add New Data In Amazon Web Services Bucket

From the post:

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

That’s good news!

At least I think so.

I am sure like everyone else, I will be trying to find the cycles (or at least thinking about it) to play (sorry, explore) the Common Crawl data set.

I hesitate to say without reservation this is a good thing because my data needs are more modest than searching the entire WWW.

That wasn’t so hard to say. Hurt a little but not that much. ;-)

I am exploring how to get better focus on information resources of interest to me. I rather doubt that focus is going to start with the entire WWW as an information space. Will keep you posted.

running mahout collocations over common crawl text

Tuesday, March 6th, 2012

running mahout collocations over common crawl text by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Can you answer Mat’s question about the incidence of Lithuanian pages? (Please post here.)

Web Data Commons

Tuesday, January 24th, 2012

Web Data Commons: Extracting Structured Data from the Common Web Crawl

From the post:

Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, …).

We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.

An interesting mining of open data.

The ability to perform comparisons on data over time is particularly interesting.

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Thursday, January 12th, 2012

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

From the post:

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Any takers?

It will be this weekend but I will be reporting back next Monday.

Common Crawl

Wednesday, November 30th, 2011

Common Crawl

From the webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data. For access to our open source code, please see our GitHub repository.

Current crawl is reported to be 5 billion pages. That should keep you hard drives spinning enough to help with heating in cold climes!

Looks like a nice place to learn a good bit about searching as well as processing serious sized data.