Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 19, 2012

finding names in common crawl

Filed under: Common Crawl,Natural Language Processing — Patrick Durusau @ 1:34 pm

finding names in common crawl by Mat Kelcey.

From the post:

the central offering from common crawl is the raw bytes they’ve downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they’ve done this extraction as a part of post processing the crawl and it’s freely available too!

If you don’t know “common crawl,” now would be a good time to meet the project.

From their webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

Mat gets you started by looking for names in the common crawl data set.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress