finding names in common crawl by Mat Kelcey.
From the post:
the central offering from common crawl is the raw bytes they’ve downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they’ve done this extraction as a part of post processing the crawl and it’s freely available too!
If you don’t know “common crawl,” now would be a good time to meet the project.
From their webpage:
Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.
Mat gets you started by looking for names in the common crawl data set.