Scrapy and Elasticsearch by Florian Hopf.
From the post:
On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.
Web Crawling
You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.
The Example
In this post we are looking at a rather artificial example: Crawling the meetup.com page for recent meetups to make them available for search. Why artificial? Because meetup.com has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well. (emphasis in original)
Not everything you need to know about Scrapy but enough to get you interested.
APIs for data are on the up swing but web scrapers will be relevant to data mining for decades to come.